Table of Content:
Introduction
Let's say you want to make a website. You're big now, in fact, you're massive so you want to try and code it yourself. Heck you've even pressed F12 a bunch of times on your favorite pages to see the machinations and mechanisms working. That's when it hit you: It's actually not that hard. Indeed, it's a markup language, it formats text and you probably used one before already. If you've ever used Discord and sent something like:
*look out my window*, and looked like this:
“look out my window”, you've effectively used Markdown1 which is just one of many markup languages out there.
Let's say you made a few experiments with HTML and copy-pasted the swagger of your favorite sites, edited a bit and had fun. You may have noticed the first line usually is <!DOCTYPE html>
2.
But then you saw it:
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"
"http://www.w3.org/TR/html4/loose.dtd">
Or even worse:
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
And here you thought you knew it all… So what's XHTML?
Well, First we need to understand HTML.
HTML
You may have noticed in the DocType Declaration for HTML4 up above the word “Transitional” and for XHTML the word “Strict”. Simply put, there are different ways for your browser to interpret documents depending on their type, this is what the DocType was for. It so happens that there are multiple versions of HTML3, namely, Strict, Transitional, Frameset, etc. In this day and age, HTML5 only has one doctype; the classic <!DOCTYPE html>
but browsers can still read older versions of HTML.
The purpose
So we need a little history lesson on the purpose of the doctype. Back then HTML was based on SGML, an old markup language that used tags. The lines from earlier are in fact SGML4:
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"
"http://www.w3.org/TR/html4/loose.dtd">
Here the SGML declaration DOCTYPE is used to specify how to use the file, i.e. with the HTML specifications of HTML 4.01 Transitional. That's the dictionary of characters that is used in HTML, so the stuff that can be parsed by your browser. It also comes with a DTD (Document Type Definition). In our case http://www.w3.org/TR/html4/loose.dtd
which contains the set of tags and entities used in HTML 4.01 Transitional. That's why there are Doctypes for MathML and SVG as well, these are not languages really, just different ways of using SGML with their own sets of tags and syntax.
A Doctype ensures the validity of a written document and that it is parsed the same way across browsers/machines. In our case, a doctype in an HTML document tells the browser the HTML was written following the standards specified in the doctype therefore it will be parsed following those standards. The browser then parses the document in a “Standard-compliant mode”, simply called standard mode.
You might ask: browser modes??? Fair question.
Browser modes
Without a Doctype, HTML would be parsed without any standards for reference, so the browser would assume “Quirks mode” and emulate a legacy web engine from before the W3C standards were created. Back when our glorious choices were Internet Explorer or Netscape Navigator, people had double the work as the two web engines worked differently and weren't compatible with one another. When those standards came out, so did the browser modes5, otherwise it would have meant the death of anything written before.
Quirks Mode would handle the legacy web made the old fashion way and Standard Mode would make use of the doctypes and DTDs, a unique way to process HTML and ensure compatibility across browsers, the newer versions of internet explorer, Firefox and what we now know today.
So HTML is a markup language telling browsers how to behave and display cool content like this site. But I would sound like a tech god if I said I used [X]HTML, so let's see…
XHTML
XHTML is a “A Reformulation of HTML 4 in XML 1.0”. So XHTML is HTML + XML. An HTML document parsed as XML which therefore has to respect the XML syntax6. HTML allows room for errors and the browser tries to bypass them for you, not XML. You'll just get an error message instead. Since upon encountering an error, XML stops parsing, it is also faster than parsing HTML. It's honestly great, I used it for a long time to teach myself HTML, it's very logical and you cannot learn bad habits from it, lest it blows up. I recommend 10/10. Make sure the file type is .xhtml, use the DocType above and to make sure your file is parsed as XML, set the MIME type to application/xhtml+xml
and you'll have a fascist version of HTML4 for you to practice with.
Syntax
It should look like this:
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en">
<head>
<meta http-equiv="content-type" content="application/xhtml+xml; charset=utf-8" />
<title>XHTML file</title>
</head>
<body>
<h1>Hello World!</h1>
</body>
</html>
The Doctype will tell the browser to use standard mode but the meta tag sets the content type of the document as application/xhtml+xml
, telling the browser to use its XML parser to read it.
If you press F12 and view the source code for the XHTML file just created, the console won't display it in colors. That's because it's being parsed as an XML file. This is what people used to call “tag soup” back in the 2000s. It could mean possible compatibility issues. I saw concerns about Internet Explorer 7 and earlier versions not supporting XHTML at all, luckily it's long gone now. Heck I don't even know if I was born when IE7 came out. It just appears that XHTML was abandonned in favor of the future of HTML4. So that begs the question: Why even choose XHTML aside from practicing?
The purpose
Here's what I found on Stack Overflow7:
“What the fundamental point that most people seem to be missing, is the purpose behind XHTML. One of the major reasons for developing the XHTML specification was to de-emphasise presentation-related tags in the markup, and to defer presentation to CSS. Whilst this separation can be achieved with plain HTML, this behaviour isn't promoted by the specifcation.” I don't know about 2009, when it was posted but I can put all the CSS I want in my XHTML (served as text/html) and I can defer all I want in any HTML file. So it all comes to choice. Granted, bad practice should be avoided but both achieve the same goal in the exact same way.
And every time I read that “XHTML was not developed for use by itself, but by use with a variety of other technologies.” I have yet to see a concrete example of XHTML used with “other technologies”. It was conceived with interoperability in mind but no one knows what, when, how. Don't ask me how to implement XML to HTML, I don't know anything about XML. Even if I did, I'm just writing a blog, I don't think I'll ever need to write XML in the middle of my web page.
The only example that comes to mind was to write MathML on a webpage but HTML5 supports MathML nowadays. And actually that's false. MathML has its own doctype so just use that instead.
Realisation
So the main CRITICAL difference, that is relevent to casual static website enjoyers, is the syntax. That's it. And nothing prevents you from writing well-formed HTML with properly nested tags, although I still think XHTML is the way to go for beginners.
XHTML vs HTML was a hot debate in the early 2000s because XHTML was supposed to become the intended new specification of the era. Then W3C decided to expand HTML 4.01 instead and develop HTML5. This has all been as petty as the console war in retrospect…
HTML5
Wake up it's 2008, HTML5 just came out and it's built to last… Forever. It has no DTDs no alternative versions just the new and final doctype8:
<!DOCTYPE html>
If it could it wouldn't have any, as HTML5 is no longer based on SGML, but browser modes are still around and to ensure we are using standard mode, we're stuck with that first line forever. So to answer the question: Why use a doctype? It's to tell the browser to parse a document following standard specifications (XHTML, HTML 4, SVG etc) as said earlier. In the case of HTML5 the doctype tells the browser to use standard mode, a cross-browser consistency thing more than checking for specs at the same time.
Since HTML5 is its own markup language, an .html file will now be read with the default MIME type text/html
.
Let's Read
Let's read some code:
I made a simple page in both HTML5 and XHTML. Let's compare the syntax:
Here for XHTML:
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en">
<head>
<meta http-equiv="content-type" content="application/xhtml+xml; charset=utf-8" />
<title>XHTML file</title>
</head>
<body>
<h1>Hello World!</h1>
</body>
</html>
And HTML5:
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset=UTF-8>
<title>HTML file</title>
</head>
<body>
<h1>Hello World!</h1>
</body>
</html>
My point is: the Doctype, html and meta tags take less characters. the first file is 344B whereas the second is only 138B. On a bigger file, the difference would hardly matter of course. But there are many other ways to shorten HTML while still keeping its validity e.g each line from an ordered list <ol>
section is a “list item” <li>
which I never close because as soon as your browser sees a new <li>
tag starting, it closes the previous one automatically9.
You may object but this is as valid as the XHTML written above:
<!DOCTYPE html><html lang="en"><title>HTML</title><p>Hello World!</html>
Well the validator would complain that there is no specified Charset but it's only a detail. SO the HTML parser of your browser can “autocomplete” files, permitting shortcuts. Shorter files mean faster client/server transfer and I can type less while still having valid files.
All in all, whether you want to optimize the Handshake or the Parsing, in today's internet it comes down to optimizing a few millisecond.
By the Way
Did you know that HTML5 can be set to use the XML parser and therefore do everything XHTML does?
XHTML5 looks like this: (although it is still referred to as HTML5)
<!DOCTYPE html>
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en">
<head>
<title>XHTML file</title>
</head>
<body>
<h1>Hello World!</h1>
</body>
</html>
Amazingly it still calls the XHTML namespace from 1999 and it does it with today's HTML.
Checking the file's validity on W3C tells us it's using the preset for XHTML + SVG 1.1 + MathML 3.0 + RDFa 1.1
. So basically it can read any markup language by W3C at this point. That renders the Doctype for XHTML 1.0 useless.
It is a .xhtml file, served with the application/xhtml+xml
MIME type and follows the XML syntax, so no shortcuts. The Doctype doesn't change and the meta tag can be omitted altogether. So it's still shorter than XHTML 1.0; the best of both worlds so far.
This is better training grounds for beginners. Now you can freely switch parsing methods just by editing the html tag and the extension of the file back to .html (well null end tags, or void tags, won't validate well on both but whatever).
Well it seems that HTML5 is the absolute choice when it comes to web development.
HOLD IT!
Why isn't the HTML5 docs hosted on W3? Who are the new ringleaders10? For privacy/anonymity schizos that's enough proof to tell you the internet is doomed, in fact it's been over for a long time. W3C wanted, as with every other versions of HTML, a finished version of HTML5. A final version of it and move on to probably HTML6. Other contributors wanted a rolling release of HTML that could be worked on everyday, forever.
So with an ongoing version of something you'll never know when whoever's in charge will start fooling around like foolish fools, ruining everything for everyone. People using older versions of HTML are using Static, Fully Finished versions of HTML.
So I guess if you think HTML5 is being maintained by the devil, don't use it. The internet was never shutdown since it came out, so the old stuff is still compatible and will forever be, probably. That's actually the main problem in all this as well as its solution. Since web devs are making sure everything stays compatible (future-compatible and retro-compatible) The specification has become a behemoth of History that, at the end of the day, no one needs to care about anymore.
The Solution
The problem never was the language itself, it's the user agents. A well designed page ought to be strictly parsed in Standard mode11. It turns out that HTML5, HTML4 Strict and XHTML 1.0 Strict will do that with their declared doctypes.
My documentation has been a trip back to the past so what about today?
99% of user agents accept both text/html
and application/xhtml+xml
and more. So the question of using what when and how is irrelevent nowadays. Quirks mode is a legacy thing now, the remnants of the old browser war between Netscape and Internet Explorer; useless, no one's too lazy NOT to add a Doctype (please just do it). HTML5 supports MathML SVG etc now, all other doctypes can be considered as obsolete.
Since XHTML has to be well formed, parsing XML is faster than parsing HTML and we know XHTML5 exists. Yaddah yaddah. Have you heard of caching? The real optimization of a website lays in the webserver configuration.
So it would seem the most logical choice to make for anyone today is HTML5. It was obvious from the beginning but my curiosity is now satisfied.
HTML5 lore | “Copyright © WHATWG (Apple, Google, Mozilla, Microsoft).” ↩︎
Technical