About

My name is Rob Cherny and I'm a professional Web Developer with 16 years of experience creating Web sites and Web-based applications. This site is where I write about my work, and random other things...

Learn More

Close Tab

Web Development

(X)HTML 5 and the WHATWG Against XHTML Served as HTML

So there’s been a lot of discussion lately around the new W3C HTML Working Group, it’s new charter, it’s chair and so on and so forth. There have also been posts about some of the new attributes. The W3C says they’ll work with the WHATWG (which is really a group of browser vendors, minus Microsoft) and work to move forward on XHTML and HTML 5, commonly being referred to as (X)HTML5 or Web Applications 1.0. What a lot of people don’t realize is that the specs moving forward are intended to enhance both languages, especially with XHTML 2 just about being considered dead on the vine. They reference both HTML 5 and XHTML5 (never mind 2, 3, and 4 -- not that there will be any).

Appendix C of XHTML

Several things are worth discussing in their current working drafts… but one thing I noticed while looking over the WHATWG’s current specs are that moving forward, is that XHTML cannot be served as text/html:

“XHTML documents (XML documents using elements from the HTML namespace) that use the new features described in this specification and that are served over the wire (e.g. by HTTP) must be sent using an XML MIME type such as application/xml or application/xhtml+xml and must not be served as text/html.”

Traditionally, following Appendix C of the XHTML spec, you could serve XHTML Strict as text/html to HTML “User Agents”. Now a lot of people disagree about that, of course, and that’s not where I’m going.

XHTML Compatibility is Invalid HTML

Truth be told, if you’re serving XHTML within the guidelines of Appendix C, what you’re actually serving up is invalid HTML 4.01.

The space added before self closing tags in XHTML for Web browsers isn’t part of XML, it’s a mis-interpreted unknown HTML Element Attribute being read as “/”. It might as well say “blah”, which is tossed out.

In the end, you’re either a Tantek Celik or an Ian Hickson. Both have worked on building standards compliant browsers and both have worked with the W3C.

Religious War

Now, this has become a religious war online between so called purists on both sides, and it’s something I try to stay away from – honestly I think both sides have a point.

Honestly though, somewhere inside I’ve believed in Appendix C from a certain interpretation which I don’t hear very often, but if I’m wrong, please let me know:

You’ve created a document.
It’s labeled as XHTML.
It’s served as HTML. So what, that’s what browsers support.
It is rendered and understood in an “HTML User Agent” (browser) as HTML in memory at run time at that moment, exploiting HTML’s generous error handling.
It’s valid XHTML (XML) based on its structure, it’s label, and content.
That document may even be dynamic, coming out of a CMS, so you want the fragments of what’s pushed into it be valid XHTML.
That document and the fragments from the CMS may not only be used in a Web browser.
Parts of the content may need to parsed by other systems from the File System, or with other tools, which may demand XML compliance and rules.
This may happen today, tomorrow, next week, or even next year.

So there you go. It’s not only XHTML and XML for the browser, it’s XML for other applications, uses, and purposes. The fact that people focus on the way it’s served to the browser seems short-sighted to me. It’s not the only place this document winds up, or may end up over time. Let’s hear it for a flexible specification.

Browser Manufacturers and XHTML

However, browser manufacturers got together and formed the WHATWG, and those very same browser manufacturers don’t believe in Appendix C. They don’t want you to serve your documents that way, they say it’s broken and stupid. Maybe it is, but guys… browsers aren’t the only tools using these documents. Maybe there’s other approaches, maybe there’s other tools and software which can change XHTML to HTML and back and forth, but there’s extra overhead there, and the way I see it, the marketplace and the industry are in a phase of transition. It takes time.

Backwards Compatibility and WHATWG

The WHATWG was started because they claimed the W3C wasn’t in tune with the people. And honestly, they were not, and they were right. But one of the mantras repeated over and over in WHATWG discussions is to retain backwards compatibility.

I say they’re already breaking that with notion with concepts such as predefined classes to drive various types of functionality within the browser – which says to me you’re going to break a lot of existing apps.

Also what about Microformats? Bottom line, there’s people setting up ways of doing things online without these standards bodies, and sometimes I think both the W3C and the WHATWG need to listen a little bit more. Not everyone out here with an opinion has tons of R&D time and can join those mailing lists. I think they need better ways of communicating with the masses and soliciting feedback.

Fortunately the specs aren’t set yet, and now’s the time to start to talk about it.

Links of interest:

Update 2007-01-25 AM: Apologies to Tantek Celik for initially having his name in here as "Celik Tantek" -- I have *no* idea what happened there, that was some sort of editor snafu as I was linking things up...

Jan 25, 12:49 PM in Web Development (filed under Web-Standards, XHTML)

Possibly Related Articles

Ian Hickson Jan 24, 11:13 PM #

The problem is most “XHTML” sent as text/html isn’t well-formed (let alone valid), so it won’t work when you use XML tools.

With HTML5/XHTML5, you have the choice to use either HTML5 or XHTML5, either is fine. If you want to use XML tools, use XHTML5, and it’ll work fine. If you want to use HTML tools, use HTML5, and it’ll work fine.

Also, because of the way HTML5 is defined, you can now just stick an HTML5 parser on the front of your XML tool chain and get all the benefits of XML processing, while using HTML5, and not having to rely on any error handling.

Also, note that trailing “/” characters are actually valid in HTML5. So you can basically write one document that works as both XHTML5/XML and HTML5, if you really want to. You have to jump through hoops to do it, but those are the same hoops you had to jump through when using Appendix C anyway.

HTH. -Hixie.
rob Jan 25, 12:16 PM #

Thanks for commenting and making those clarifications, Ian, I appreciate it.

For one, I didn’t realize (I confess I haven’t read the whole spec) that the trailing “/” was valid in HTML5. I probably shouldn’t have spoken so critical without reading the whole thing.

One question, what happens when in the world of XHTML 1.0 docs served as text/html going forward? If the DTD isn’t changed, does the User Agent just follow the same rules it does today? That seems to be the case with more strict mime type rules coming along and different DTD’s for (X)HTML 5.

I guess another thing I’m foggy on is, what is an HTML5 parser, and who’s making them (not talking about Web browsers, obviously)? Ultimately with the spec defined are these parsers things that the WG hopes will start to spring up? I don’t know many HTML 4 parsers out there, but I’m not as “in” on server-side tools and parsing tools such as those today.

Are there any examples you can point to? A concern of mine is most CMS’s don’t necessarily have an architecture that can support random tools such as these as being introduced into the content delivery chain.

If you can use the trailing slashes in HTML5 it seems the WG is really attempting to blur the lines between the two technologies, which is probably the way to go. Being able to move back and forth between XML, XHTML5, and HTML5 sounds like an exciting prospect, but I suppose only time will tell.
jgraham Jan 25, 03:57 PM #

Well one such tool is html5lib ; a python library for parsing HTML documents which can produce several different tree formats to make it as easy as possible to interface with your exisitng code (full disclaimer – I’m one of the authors). The existance of the WHATWG spec made it possible to produce this tool with much less effort than you might expect given the complexity of the problem (parse arbitary ill-formed markup into a useful DOM tree) and, better yet, we can expect to produce a DOM tree that interoperates with other implementations— implimentations that will hopefully include most desktop web browsers.

As an example of the kind of application that benefits from this, html5lib has been incorporated into Planet Venus to replace the SGMLlib+ad-hoc error handling code. This enables a collection of invalid feeds to be turned into a well-formed application/xhtml+xml river of news .
Ian Hickson Jan 26, 05:54 PM #

As James points out, there already are parsers being written, such as the html5lib parser written in python that he mentions. As he further mentions, Sam Ruby’s aggregator is a good example of a live application of this parser.

The HTML5 spec basically defines processing of any content sent as text/html. This includes all old versions of HTML2, 3, 4, any XHTML sent as text/html, and unlabelled tag soup. The processing is compatible with existing browsers; there are detailed rules for error handling that were reverse engineered out of existing implementations, looking at how they handle all this content today. There’s no “new” rules for new content, the same rules are applied to old content and new HTML5 content. The rules were designed to make it possible to do that without anything breaking.

So XHTML 1.0 sent as text/html, in an HTML5 world, is just treated like anything else (like today), and it continues “working”.

HTML5 actually doesn’t have any special DOCTYPEs; in XML you don’t need to give a DOCTYPE at all for XHTML5, and in HTML you just say “<!DOCTYPE HTML>” (and that’s just to trigger standards mode in browsers).

HTH. Please let me know (by e-mail) if you have any other questions. Thanks!

Commenting is closed for this article.