If you're a web developer, you've probably worked a lot with XHTML, the markup language developed in 1999 to implement HTML as an XML format. Most people who use and promote XHTML do so because they think it's the “next version” of HTML, and they may have heard of some benefits here and there. But there is a lot more to it than you may realize, and if you're using it on your website, even if it validates, you are probably using it incorrectly.
I believe that XHTML has many good potential applications, and I hope it continues to thrive as a standard. This is precisely why I have written this article. The state of XHTML on the Web today is more broken than the state of HTML, and most people don't realize because the major browsers are using classic HTML parsers that hide the problems. Even among the few sites that know how to trigger the XML parser, the authors tend to overlook some important issues. If you really hope for the XHTML standard to succeed, you should read this article carefully.
XHTML is a markup language originally hoped to someday replace HTML on the Web. For the most part, an XHTML 1.0 document differs from an HTML 4.01 document only in the lexical and syntactic rules: HTML is written in its own unique syntax defined by SGML, while XHTML is written in a different SGML-defined syntax called XML. The syntaxes differ in some of the characters that delimit tags and other constructs, whether or not certain types of shorthand markup may be used, and whether or not tag names or character entities are case sensitive, among other small differences.
The Document Type Definition (DTD, which is referenced by the doctype declaration) then defines which elements, attributes, and character entities exist in the language and where those elements may be placed. The DTDs of XHTML 1.0 and HTML 4.01 are nearly identical, meaning that as far as things like elements and attributes go, XHTML 1.0 and HTML 4.01 are basically the same language. The only added benefit of XHTML is that it's written in XML and shares the benefits XML has over HTML's syntax. I'll explain those benefits later in this article, but first I'd like to debunk some of the false benefits you may have heard.
There are many false benefits of XHTML promoted on the Web. Let's clear up some of them at a glance (with details and other pitfalls provided later):
XHTML does not promote separation of content and presentation any more than HTML does. XHTML has all of the same elements and attributes (including presentational ones) that HTML has, and it doesn't offer any additional CSS features. Semantic markup and separation of content and presentation is absolutely possible in HTML, and with equal ease. In terms of semantics, HTML 4.01 and XHTML 1.0 are exactly the same.
Most XHTML pages on the Web are not parsed as XML by today's web browsers. With typical server configurations, browsers will parse your XHTML as HTML “tag soup” instead. The vast majority of XHTML pages on the Web cannot be parsed as XML since they rely on HTML error handling. Even most of the valid XHTML results in disfigured pages when parsed as XML, since they were only tested in HTML-parsing browsers.
XML parsers do not typically check documents for validity. They only check them for well-formedness, which is a separate concept. If you leave out a required element, use deprecated or nonstandard elements or attributes, or put an element somewhere it isn't allowed, the XML parser will provide no indication of the error, and the browser will have to silently deal with the error like HTML parsers do.
HTML is not deprecated and is not being phased out at this time. In fact, the World Wide Web Consortium recently renewed the HTML working group, which is working to develop HTML 5. The developers of Firefox, Opera, and Safari have pushed very hard for the development of HTML 5 and have largely ignored the development of XHTML 2. The Safari development team has even opted to not take part in the XHTML 2 development process. The CTO of Opera said in an interview, “I don't think XHTML is a realistic option for the masses. HTML5 is it.”
XHTML 1.x is not “future-compatible”. XHTML 2, currently in the drafting stages, is not backwards-compatible with XHTML 1.x. XHTML 2 will have lots of major changes to the way documents are written and structured, and even if you already have your site written in XHTML 1.1, a complete site rewrite will usually be necessary in order to convert it to proper XHTML 2. A simple XSL transformation will not be sufficient in most cases, because some semantics won't translate properly.
HTML 4.01 is actually more future-compatible. A valid HTML 4.01 document written to modern support levels will be valid HTML 5, and HTML 5 is where the majority of attention is from browser developers and the W3C.
XHTML does not have good browser support. In typical setups, most browsers simply pretend that your XHTML pages are regular HTML (which presents a number of problems, as I'll explain later). Some major browsers like Firefox, Opera, and Safari may attempt to handle the page as proper XHTML if and only if you include a special HTTP header instructing it to do so. But when you do, Internet Explorer and a number of other user agents will choke on it and won't display a page at all. Even when handled as XHTML, the supporting browsers have a number of additional bugs, which I'll also discuss in this article.
Most browsers do not parse valid XHTML dramatically faster than valid HTML, even when they're parsing XHTML correctly. This is partly because most browsers only support a small subset of the HTML/SGML standard to begin with, so the real complexities of proper HTML parsing are mostly ignored anyway. The only major additional complexity of HTML that is well supported is tag omission, but most browsers use hardcoded rules specific to HTML in order to cheat through that with minimal performance impact. The browser can lose some minor shorthand logic with XML, but it now has to use extra logic to confirm that the document is well-formed. Although XHTML, when parsed with an XML parser, may be slightly faster to parse than typical HTML, the difference isn't very significant in most cases. And either way, download speed is usually the bottleneck when it comes to document parsing. Whether it's HTML or XHTML, by the time the page finishes downloading, the whole thing is already parsed. The users won't notice any speed difference.
XHTML is not extensible if you hope to support Internet Explorer or the number of other user agents that can't parse XHTML as XML. They will handle the document as HTML and you will have no extensibility benefit.
XHTML source is not necessarily any “cleaner” than HTML source. If you prefer using lower-case tag and attribute names, you can do so in HTML. If you prefer having quotes around all attribute values, you may do so in HTML. If you prefer making sure all of your non-empty elements have end tags, you may use end tags in HTML, too. In fact, these are considered best practice principles in HTML. The only real markup differences between an HTML document following best practices and an XHTML document following the legacy compatibility guidelines are the doctype choice, XHTML's extra required attributes on the html
tag, and XHTML's extra /
character in empty element tags.
Some argue that the availability of HTML's shorthand constructs is what makes HTML “unclean”. But the only HTML shorthand construct that is required is the omitted end tag on elements that have to be empty, and it's common practice in XHTML (alas, even required in many cases) to also use a shorthand construct on those elements: the so-called “self-closing tag” which originates from SGML's “null end tag” shorthand construct.
If you prefer to minimize your use of shorthand markup and would like the validator to enforce those restrictions in HTML as well, you can use Web Devout's HTML Good Practice Checker.
Using XHTML does not encourage better support by web browsers and it is not “a vote for a better Web” if you are still supporting Internet Explorer and various search engines and other user agents that require text/html
. If you serve it with the typical text/html
content type, you are giving all browsers a thumbs-up to treat it exactly like classic HTML, meaning absolutely no progress is made. Even if you use only application/xhtml+xml
and shut out Internet Explorer and various other user agents entirely, it won't mean anything: Microsoft already plans to support real XHTML in an upcoming release of Internet Explorer; they just want to make sure they support it correctly from the initial launch. Even still, XHTML 1.x is a dead-end standard, since it's completely incompatible with XHTML 2.0 and all other future HTML/XHTML standards, as explained aboved, and since the majority of XHTML content on the Web today cannot be safely parsed as XML.
XML does have a number of improvements over HTML's syntax:
Although HTML's syntax allowed for a lot of shorthand markup and other flexibility, it proved too difficult to write a correct and fully-featured parser for it, since a truly correct parser would have to support the entire SGML standard. As a result, most user agents, including all of today's major web browsers, make many technically unsound assumptions about the lexical format of HTML documents and don't support a number of shorthand features like Null End Tags (<tag/Content/
), unclosed start/end tags (<tag<tag>
), and empty tags (<>
). XML was designed to eliminate these extra features and restrict documents to a tight set of rules that are more straight-forward for user agents to implement. In effect, XML defines the assumptions that user agents are allowed to make, while still resulting in a file that a theoretical fully-featured SGML user agent could parse once pointed to XML's SGML declaration.
It should be noted that an XML parser for the most part is not dramatically easier to write than the level of HTML support offered by most HTML parsers (which will be thoroughly specified in HTML 5). Most of the features that would make HTML more difficult to write a parser for, such as custom SGML declarations, additional marked sections, and most of the shorthand constructs, have negligible use on the Web anyway and generally have poor or absent support in major web browsers. The most significant difference is XML's lack of support for omitted start and end tags, which in theory could amount to complicated logic in HTML for elements not defined as empty. Even still, most browsers don't bother to implement real DTD-based parsing logic, so it isn't quite so complicated in practice.
In hopes of eliminating some error handling logic, XML user agents are told to not be flexible with errors: if a user agent comes upon a problem in the XML document, it will simply give up trying to read it. Instead, the user will be presented with a “parse error” message instead of the webpage. This eliminates the compatibility issues with incorrectly-written markup and browser-specific error handling methods by requiring documents to be “well-formed”, while giving webpage authors immediate indication of the problem. This does, however, mean that a single minor issue like an unescaped ampersand (&
) in a URL or a mismatched character encoding in a trackback message would cause the entire page to fail, and so most of today's public web applications can't safely be incorporated in a true XHTML page.
While user agents are supposed to fail on any page that isn't well-formed (in other words, one that doesn't follow the generic XML grammar rules), they do not have to fail on a page that is well-formed but invalid. For example, although it is invalid to have a span
element as an immediate child of the body
element, most XML-supporting web browsers won't provide indication of the error because the page is still well-formed — that is, the DTD is violated, but not the fundamental rules of XML itself. Some user agents may choose to be “validating” agents and will also fail on validity errors, but they aren't common. There is some worry that people may rely too heavily on the well-formedness checker and forget to also check for validity, which could lead to a higher occurrence of invalid pages even among the otherwise standards-conscious developers.
Despite popular assumption, even if an XML page is perfectly valid according to some validators, it still might not be well-formed. Well-formedness involves some requirements not present in the classic SGML definition of validity.
Unlike HTML's SGML-defined syntax, which was specifically made for HTML, XML is a common syntax used in many different languages. This means that a single relatively simple set of parsing logic can handle a number of different languages. It also paved the way for the Namespaces in XML standard, which allows multiple documents in different XML formats to be combined in a single XML document, so that you can have, for example, an XHTML page that contains one or more SVG images that use MathML inside them.
The practicality of this on webpages is a subject of debate. Separation of content, presentation, and behavior is a defining characteristic of modern web development, as the modular setup provides many benefits over the mixed alternative. Those benefits also hold true when debating the idea of mixed XML formats. Since (X)HTML provides facilities for embedding other XML formats in a more modular fashion (using elements like object
and link
), it's usually better to use the modular approach rather than mixing the files in a single document. Moving the SVG or RSS data into files separate from the (X)HTML allows the user agent to cache them and improve performance while reducing bandwidth cost and easing maintainability.
When your website sends a document to the visitor's browser, it adds on a special content type header that lets the browser know what kind of document it's dealing with. For example, a PNG image has the content type image/png
and a CSS file has the content type text/css
. HTML documents have the content type text/html
. Web servers typically send this content type whenever the file extension is .html
, and server-side scripting languages like PHP also typically send documents as text/html
by default.
XHTML does not have the same content type as HTML. The proper content type for XHTML is application/xhtml+xml
. Currently, many web servers don't have this content type reserved for any file extension, so you would need to modify the server configuration files or use a server-side scripting language to send the header manually. Simply specifying the content type in a meta
element will not work over HTTP.
When a web browser sees the text/html
content type, regardless of what the doctype says, it automatically assumes that it's dealing with plain old HTML. Therefore, rather than using the XML parsing engine, it treats the document like tag soup, expecting HTML content. Because HTML 4.01 and simple XHTML 1.0 are often very similar, the browser can still understand the page fairly well. Most major browsers consider things like the self-closing portion of a tag (as in <br />
) as a simple HTML error and strip it out, usually ending up with the HTML equivalent of what the author intended.
However, when the document is treated like HTML, you get none of the benefits XHTML offers. The browser won't understand other XML formats like MathML and SVG that are included in the document, and it won't do the automatic validation that XML parsers do. In order for the document to be treated properly, the server would need to send the application/xhtml+xml
content type.
The problems go deeper. Comment markers are sometimes handled differently depending on the content type, and when you enclose the contents of a script
or style
element with basic SGML-style comments, it will cause your script and style information to be completely ignored when the document is treated like XML. Also, any special markup characters used in the inline contents of a style
or script
element will be parsed as markup instead of being treated as character data like in HTML. To solve these problems, you must use an elaborate escape sequence described in the article Escaping Style and Script Data, and even then there are situations in which it won't work.
Furthermore, the CSS and DOM specifications have special provisions for HTML that don't apply to XHTML when it's treated as XML, so your page may look and behave in unexpected ways. The most common problem is a white gap around your page if you have a background on the body
, no background on the html
element, and any kind of spacing between the elements, such as a margin
, padding
, or a body
height
under 100%
(browsers typically have some combination of these by default). In scripting, tag names are returned differently and document.write()
doesn't work in XHTML treated as XML. Table structure in the DOM is different between the two parsing modes. These are only a select few of the many differences.
The following are some examples of differing behavior between XHTML treated as HTML and XHTML treated as XML. The anticipated results are based on the way Internet Explorer, Firefox, and Opera treat XHTML served as HTML. Some other browsers are known to behave differently. Also note that Internet Explorer doesn't recognize the application/xhtml+xml
content type (see below for an explanation), so it will not be able to view the examples in the second column.
text/html | application/xhtml+xml |
---|---|
Example 1 | Example 1 |
Example 2 | Example 2 |
Example 3 | Example 3 |
Example 4 | Example 4 |
Example 5 | Example 5 |
Example 6 | Example 6 |
Example 7 | Example 7 |
Example 8 | Example 8 |
Example 9 | Example 9 |
When the XHTML 1.0 specification was first written, there were provisions that allowed an XHTML document to be sent as text/html
as long as certain compatibility guidelines were followed. The idea was to ease migration to the new format without breaking old user agents. However, these provisions are now viewed by many as a mistake. The whole point of XHTML is to be an XML alternative to HTML, yet due to the allowance of XHTML documents to be sent as text/html
, most so-called XHTML documents on the Web today would break if they were treated like XML (see the real-world examples below). This even includes many valid XHTML documents. Several prominent members of the W3C are now challenging the wisdom of the text/html
provisions and advocating that this content type should never be allowed for XHTML.
Many authors incorrectly believe that following the HTML compatibility guidelines and validating the document will guarantee that the document is compatible with both the HTML and XHTML specifications. In reality, if you use even a single self-closing tag in the document (which includes any link
, img
, or br
tag), you are already creating incompatibilities between the two specifications. The reason for this particular issue is explained below. In this article, I have already explained a number of other factors not covered in XHTML 1.0 Appendix C that will also cause the document to run into incompatibilities. The truth is that the HTML compatibility guidelines do not actually provide true compatibility between HTML and XHTML; they merely attempt to minimize the damage of using text/html
for XHTML documents, and that damage control is very limited in effectiveness.
XHTML 1.x already makes no provision for the use of text/html
when taking advantage of any XHTML features not present in HTML, and the current draft of XHTML 2 expressly forbids it.
Internet Explorer does not support XHTML. Like other web browsers, when a document is sent as text/html
, it treats the document as if it was a poorly constructed HTML document. However, when the document is sent as application/xhtml+xml
, Internet Explorer won't recognize it as a webpage; instead, it will simply present the user with a download dialog. This issue still exists in Internet Explorer 7.
Although all other major web browsers, including Firefox, Opera, Safari, and Konqueror, support XHTML, the lack of support in Internet Explorer as well as major search engines and web applications makes use of it very discouraged.
Content negotiation is the idea of sending different content depending on what the user agent supports. Many sites attempt to send XHTML as application/xhtml+xml
to those who support it, and either XHTML as text/html
or real HTML to those who don't.
There are two methods generally used to determine what the user agent supports, using the Accept
HTTP header: most often, sites use the incorrect method where they simply look for the string “application/xhtml+xml
” in the header value; although some sites will use the correct method, where they actually parse the header value, supporting wildcards and ordering by q value.
Unfortunately, neither of these methods works reliably.
The first method doesn't work because not all XHTML-supporting user agents actually have the text “application/xhtml+xml
” in the Accept
header. Safari and Konqueror are two such browsers. The application/xhtml+xml
content type is implied by a wildcard value instead. Meanwhile, not all HTML-supporting user agents have “text/html
” in the header. Internet Explorer, for example, doesn't mention this content type. Like Safari and Konqueror, it implies this support by using a wildcard. Even among those user agents that support XHTML and mention application/xhtml+xml
in the header, it may have a lower q value than text/html
(or a matching wildcard), which implies that the user agent actually prefers text/html
(in other words, its XHTML support may be experimental or broken).
The second method (the correct, 100% standards-complaint one) doesn't work because most major browsers have inaccurate Accept
headers:
application/xhtml+xml
listed with a higher q value than text/html
, even though Mozilla has posted an official recommendation on its site saying that websites should use text/html
for these versions if they can, for reasons described below.text/html
or application/xhtml+xml
in its Accept
header. Instead, both content types are covered by a single wildcard value (which implies that every content type in existence is supported equally well, which is obviously untrue). So Internet Explorer is saying that it supports both text/html
and application/xhtml+xml
equally, even though it actually doesn't support application/xhtml+xml
at all. In the case that a user agent claims to support both equally, the site is supposed to use its own preference. A possible workaround is for the site to “prefer” sending text/html
or, in a toss-up situation, only send application/xhtml+xml
if it's actually mentioned explicitly in the header. However...text/html
and application/xhtml+xml
the same q value (in fact, like Internet Explorer, they also claim to support everything in existence equally well). But they don't mention application/xhtml+xml
explicitly — it's implied by a wildcard. So if you use the above workaround, Safari and Konqueror will receive text/html
even though they really do support application/xhtml+xml
.As disappointing as it may be, content negotiation simply isn't a reliable approach to this problem.
In XHTML, all elements are required to be closed, either by an end tag or by adding a slash to the start tag to make it self-closing. Since giving empty elements like img
or br
an end tag would confuse browsers treating the page like HTML, self-closing tags tend to be promoted. However, XML self-closing tags directly conflict with a little-known and poorly supported HTML/SGML feature: Null End Tags.
A Null End Tag is a special shorthand form of a tag that allows you to save a few characters in the document. Instead of writing <title>My page</title>
, you could simply write <title/My page/
to accomplish the same thing. Due to the rules of Null End Tags, a single slash in an empty element's start tag would close the tag right then and there, meaning <br/
is a complete and valid tag in HTML. As a result, if you have <br/>
or <br />
, a browser supporting Null End Tags would see that as a br
element immediately followed by a simple >
character. Therefore, an XHTML page treated as HTML could be littered with unwanted > characters.
This problem is often overlooked because most popular browsers today are lacking support for Null End Tags, as well as some other SGML shorthand features. However, there are still some smaller user agents that properly support Null End Tags. One of the more well-known user agents that support it is the W3C validator. If you send it a page that uses XHTML self-closing tags, but force it to parse the page as HTML/SGML like most user agents do for text/html
pages, you can see the results in the outline: immediately after each of the self-closing elements, there is an unwanted >
character that will be displayed on the page itself.
(It should be noted that the W3C Validator is unusual in that it generally determines the parsing mode from the doctype, rather than from the content type as most other user agents do. Therefore, an HTML doctype was used in the above example just so the validator would attempt to parse the page using the HTML syntax as all major browsers will for text/html
pages regardless of the doctype. The Null End Tag rules are actually set in the SGML syntax definition, not the DTD, so this example is accurate to what you should expect in a fully compliant SGML user agent even with an XHTML doctype.)
Technically, a restricted and altered form of Null End Tags exists in XML and is frequently used: the self-closing portion of the start tag. While Null End Tags are defined as / ... /
in HTML's syntax, they are specially defined as / ... >
in XML with the added restriction that it must close immediately after it is opened, meaning the element must have no content. This was designed to look similar to a regular start tag for web developers who are unfamiliar with typical Null End Tags. However, in the process it creates inherent incompatibility with HTML's syntax for all empty elements.
In summary, although this issue doesn't show in most popular web browsers, a user agent that more fully supports SGML would see unwanted > characters all over XHTML pages that are sent with the text/html
content type. If the goal of using XHTML is to help promote standards, then it's quite counterproductive to cause unnecessary problems for user agents that more correctly comply to the SGML standard.
Although Firefox supports the parsing of XHTML documents as XML when sent with the application/xhtml+xml
content type, its performance in versions 2.0 and below is actually worse than with HTML. When parsing a page as HTML, Firefox will begin displaying the page while the content is being downloaded. This is called incremental rendering. However, when it's parsing XML content, Firefox 2.0 and below will wait until the entire page is downloaded and checked for well-formedness before any of the content is displayed. This means that, although in theory XML is supposed to be faster to parse than HTML, in reality these versions of Firefox usually display HTML content to the user much faster than XHTML/XML content. Thankfully, this issue is expected to be resolved in Firefox 3.0.
However, there are also issues in other browsers, such as certain HTML-specific provisions in the CSS and DOM standards being mistakenly applied to XHTML content parsed as XML. For example, if there is a background set on the body
element and none on the html
element, Opera will apply the background to the html
element as it would in HTML. So even when dealing exclusively with XHTML parsed as XML, you still run into a number of the same problems that you do when trying to serve XHTML either way.
All in all, true XHTML support in major user agents is still very weak. Because a key user agent — namely, Internet Explorer — has made no visible effort to support XHTML, other major user agents have continued to see it as a relatively low priority and so these bugs have lingered. HTML is recommended over XHTML by both Mozilla and Safari and is generally better supported than XHTML by all major browsers.
XHTML is a very good thing, and I certainly hope to see it gain widespread acceptance in the future. However, it simply isn't widely supported in its proper form. XHTML is an XML format, and to force a web browser to treat it like HTML is going against the whole purpose of XHTML and also inevitably causes other complications. Assuming you don't want to dramatically limit access to your information, XHTML can only be used incorrectly, be interpretted as invalid markup by most user agents, cause unwanted results in others, and offer no added benefit over HTML. HTML 4.01 Strict is still what most user agents and search engines are most accustomed to, and there's absolutely nothing wrong with using it if you don't need the added benefits of XML. HTML 4.01 is still a W3C Recommendation, and the W3C has even announced plans to further develop HTML alongside XHTML in the future.
[I]f we tried to support real XHTML in IE 7 we would have ended up using our existing HTML parser (which is focused on compatibility) and hacking in XML constructs. It is highly unlikely we could support XHTML well in this way [...] I would much rather take the time to implement XHTML properly after IE 7, and have it be truly interoperable
Serving valid HTML 4.01 as text/html ensures the widest browser and search engine support.
On today's web, the best thing to do is to make your document HTML4 all the way. Full XHTML processing is not an option, so the best choice is to stick consistently with HTML4.
I don't think XHTML is a realistic option for the masses. HTML5 is it.
I'm an advocate of using XHTML only in the correct way, which basically means you have to use HTML. Period.
Authors intending their work for public consumption should stick to HTML 4.01
What we needed most is the acknowledgement that the Web is based on HTML 4, CSS, JavaScript and a few other technologies. That is now done, the W3C working on a successor to HTML 4 based on the work done by the WHAT-WG. XHTML 2 is not the future of the Web.
The following are just a few of the countless sites that use an XHTML doctype but, as of this moment of writing, completely fail to load or otherwise work improperly when parsed as XML, thus missing the whole point of XHTML. The authors of most of these sites are quite prominent in the web standards community — many are involved in the Web Standards Project (WaSP) — yet they have still fallen victim to the pitfalls of current use of XHTML. In fact, I have found that nearly all XHTML websites owned by WaSP members have problems when parsed as XML.
You could consider this a “shame list” of sorts. These are the same people who are supposed to be teaching others how to use web standards properly, yet they have written markup that basically depends on browsers treating it incorrectly. But the main point of this list isn't to pick on individuals; it's to reinforce the fact that even so-called experts at web standards have trouble juggling the different ways XHTML will inevitably be handled on the Web. And what benefit does it bring? None of the following sites make use of anything XHTML offers over HTML.
The following “View as application/xhtml+xml” links allow you to see how the pages would look when sent with the proper XHTML content type. This script adds a base
element so that relative URLs work properly, but no other modifications are made to the markup. Alternatively, you can test the original unaltered page's XHTML rendering in Firefox using the Force Content-type extension and setting the new content-type to application/xhtml+xml
.
These links were last checked 2007-09-23.
The following are some significant sites relevant to web standards that continue to use HTML rather than XHTML.