Hacker News

"if only for the fact that a surprising amount of people adamantly refuse to believe that they are valid... And even the people who are eventually convinced that it’s valid still think that it is technically incorrect in some unspecified way."

Speaking from my personal experience, if your idea of "valid HTML" was created in the late 1990s or early 2000s, it's worth a spin through the current HTML standard. HTML has always de facto been permissive, but de jure it had certain requirements. However, HTML 5 essentially works by reifying a very, very well-specified algorithm for how to handle HTML "loosely" (even though it is very strictly specified), and then refactors away effectively every requirement it possibly can and defers them to that algorithm instead.

Technically speaking, as long as you put down the correct doctype, you can elide almost anything nowadays and get a functional document; for instance, "<!DOCTYPE html><title>Hello</title>" is fully standards compliant now (push it through [1]). Only thing the validator gives is a warning that you might like to specify a language in the doctype. It isn't just "browsers will pretty much do the 'right thing'" with that, which has been true for a long time... that's actually standards-compliant HTML now.

What a lot of old hands don't understand is that HTML 5 was a seismic shift in how HTML is specified. Instead of specifying a rigid language and then pretending the world is complying and it's super naughty of them not to, it defines a standard for extracting a DOM tree from effectively any soup of characters you can throw at it, compliance is loosened as much as is practical, and even when things don't comply there's a specification on exactly how to pick up the pieces. HTML 5 has a completely different philosophy than HTML 4 and before.

(Relatedly, the answer to the frequently-asked question "What is the BeautifulSoup equivalent for $LANGUAGE", at least as far as parsing, is effectively now "Find an HTML 5-compliant parser", which they all have now. Beautiful Soup's parsing philosophy was enshrined into the standard.)

[1]: https://validator.w3.org/nu/#textarea

It’s fair to point out the big difference in parsing philosophy between HTML 2–4 and HTML 5, but what I’m talking about happened before HTML5 as well. Some people can’t handle the fact that HTML intentionally has implied elements.

> <!DOCTYPE html><title>Hello</title>" is fully standards compliant now

Sure, but switch the doctype and put a <p> on the end, and it’s fully standards compliant HTML 4.01 Strict too. And yet so many people are adamant that it can’t be. That it’s invalid (even though a validator says it’s valid). That it’s relying on error handling (even though the spec. says otherwise). That some browsers parse it wrong (but they can never name one). That the DOM ends up broken (when browser dev tools show a normal DOM). That you need <html> and <body> elements (even though it already has both). That there’s something wrong with it at a technical level (even though they cannot describe what).

The concept “This is correct HTML that works everywhere with no error handling” is very difficult for some people to grasp, to a genuinely surprising degree.