Hacker News

The fact that you advocate using a BOM with UTF-8 tells me that you run Windows. Any long-term Unix user has probably seen this error message before (copy and pasted from an issue report I filed just 3 days ago):

    bash: line 1:  #!/bin/bash: No such file or directory

If you've got any experience with Linux, you probably suspect the problem already. If your only experience is with Windows, you might not realize the issue. There's an invisible U+FEFF lurking before the `#!`. So instead of that shell script starting with the `#!` character pair that tells the Linux kernel "The application after the `#!` is the application that should parse and run this file", it actually starts with `<FEFF>#!`, which has no meaning to the kernel. The way this script was invoked meant that Bash did end up running the script, with only one error message (because the line did not start with `#` and therefore it was not interpreted as a Bash comment) that didn't matter to the actual script logic.

This is one of the more common problems caused by putting a BOM in UTF-8 files, but there are others. The issue is that adding a BOM, as can be seen here, *breaks the promise of UTF-8*: that a UTF-8 file that contains only codepoints below U+007F can be processed as-is, and legacy logic that assumes ASCII will parse it correctly. The Linux kernel is perfectly aware of UTF-8, of course, as is Bash. But the kernel logic that looks for `#!`, and the Bash logic that look for a leading `#` as a comment indicator to ignore the line, do *not* assume a leading U+FEFF can be ignored, nor should they (for many reasons).

What should happen is that these days, every application should assume UTF-8 if it isn't informed of the format of the file, unless and until something happens to make it believe it's a different format (such as reading a UTF-16 BOM in the first two bytes of the file). If a file fails to parse as UTF-8 but there are clues that make another encoding sensible, reparsing it as something else (like Windows-1252) might be sensible.

But putting a BOM in UTF-8 causes more problems than it solves, because it *breaks* the fundamental promise of UTF-8: ASCII compatibility with Unicode-unaware logic.

mikelabatt 3 days ago [ - ]

I like your answer, and the others too, but I suspect I have an even worse problem than running Windows: I am an Amiga user :D

The Amiga always used all 8 bits (ISO-8859-1 by default), so detecting UTF-8 without a BOM is not so easy, especially when you start with an empty file, or in some scenario like the other one I mentioned.

And it's not that Macs and PCs don't have 8-bit legacy or coexistence needs. What you seem to be saying is that compatibility with 7-bit ASCII is sacred, whereas compatibility with 8-bit text encodings is not important.

Since we now have UTF-8 files with BOMs that need to be handled anyway, would it not be better if all the "Unicode-unaware" apps at least supported the BOM (stripping it, in the simplest case)?

rmunn 3 days ago [ - ]

"... would it not be better if all the "Unicode-unaware" apps at least supported the BOM (stripping it, in the simplest case)?"

What that question means is that the Unicode-unaware apps would have to become Unicode-aware, i.e. be rewritten. And that would entirely defeat the purpose of backwards-compatibility with ASCII, which is the fact that you don't have to rewrite 30-year-old apps.

With UTF-16, the byte-order mark is necessary so that you can tell whether uppercase A will be encoded 00 41 or 41 00. With UTF-8, uppercase A will always be encoded 41 (hex, or 65 decimal) so the byte-order mark serves no purpose except to signal "This is a UTF-8 file". In an environment where ISO-8859-1 is ubiquitous, such as the Web fifteen years ago, the signal "Hey, this is a UTF-8 file, not ISO-8859-1" was useful, and its drawbacks (BOM messing up certain ASCII-era software which read it as a real character, or three characters, and gave a syntax error) cost less then the benefits. But now that more than 99% of files you'll encounter on the Web are UTF-8, that signal is useful less than 1% of the time, and so the costs of the BOM are now more expensive than the benefits (in fact, by now they are a lot more expensive than the benefits).

As you can see from the paragraph above, you're not reading me quite right when you say that I "seem to be saying is that compatibility with 7-bit ASCII is sacred, whereas compatibility with 8-bit text encodings is not important". Compatibility with 8-bit text encodings WAS important, precisely because they were ubiquitous. It IS no longer important in a Web context, for two reasons. First, because they are less than 1% of documents and in the contexts where they do appear, there are ways (like HTTP Content-Encoding headers or HTML charset meta tags) to inform parsers of what the encoding is. And second, because UTF-8 is stricter than those other character sets and thus should be parsed first.

Let me explain that last point, because it's important in a context like Amiga, where (as I understand you to be saying) ISO-8859-1 documents are still prevalent. If you have a document that is actually UTF-8, but you read it as ISO-8859-1, it is 100% guaranteed to parse without the parser throwing any "this encoding is not valid" errors, BUT there will be mistakes. For example, å will show up as Ã¥ instead of the å it should have been, because å (U+00E5) encodes in UTF-8 as 0xC3 0xA5. In ISO-8859-1, 0xC3 is Ã and 0xA5 is ¥. Or ç (U+00E7), which encodes in UTF-8 as 0xC3 0xA7, will show up in ISO-8859-1 as Ã§ because 0xA7 is §.

(As an aside, I've seen a lot of UTF-8 files incorrectly parsed as Latin-1 / ISO-8859-1 in my career. By now, if I see Ã followed by at least one other accented Latin letter, I immediately reach for my "decode this as Latin-1 and re-encode it as UTF-8" Python script without any further investigation of the file, because that Ã, 0xC3, is such a huge clue. It's already rare in European languages, and the chances of it being followed by ¥ or § or indeed any other accented character in any real legacy document are so vanishingly small as to be nearly non-existent. This comment, where I'm explicitly citing it as an example of misparsing, is actually the only kind of document where I would ever expect to see the sequence Ã§ as being what the author actually intended to write).

Okay, so we've established that a file that is really UTF-8, but gets incorrectly parsed as ISO-8859-1, will NOT cause the parser to throw out any error messages, but WILL produce incorrect results. But what about the other way around? What about a file that's really ISO-8859-1, but that you incorrectly try to parse as UTF-8? Well, NEARLY all of the time, the ISO-8859-1 accented characters found in that file will NOT form a correct UTF-8 sequence. In 99.99% (and I'm guessing you could end up with two or three more nines in there) of actual ISO-8859-1 files designed for human communication (as opposed to files deliberately designed to be misparsed), you won't end up with a combination of accented Latin characters that just happen to match a valid UTF-8 sequence, and it's basically impossible for ALL the accents in an ISO-8859-1 document to just so happen to be valid UTF-8 sequences. In theory it could happen, but your chances of being struck by a 10-kg meteorite while sitting at your computer are better than of that happening by chance. (Again, I'm excluding documents deliberately designed with malice aforethought, because that's not the main scenario here). Which means that if you parse that unknown file as UTF-8 and it wasn't UTF-8, your parser will throw out an error message.

So when you encounter an unknown file, that has a 90% chance of being ISO-8859-1 and a 10% chance of being UTF-8, you might think "Then I should try parsing it in ISO-8859-1 first, since that has a 90% chance of being right, and if it looks garbled then I'll reparse it". But "if it looks garbled" needs human judgment. There's a better way. Parse it in UTF-8 first, in strict mode where ANY encoding error makes the entire parse be rejected. Then if the parse is rejected, re-parse it in ISO-8859-1. If the UTF-8 parser parses it without error, then either it was an ISO-8859-1 file with no accents at all (all characters 0x7F or below, so that the UTF-8 encoding and the ISO-8859-1 encoding are identical and therefore the file was correctly parsed), or else it was actually a UTF-8 file and it was correctly parsed. If the UTF-8 parser rejects the file as having invalid byte sequences, then parse it as the 8-bit encoding that is most likely in your context (for you that would be ISO-8859-1, for the guy in Japan who commented it would likely be Shift-JIS that he should try next, and so on).

That logic is going to work nearly 100% of the time, so close to 100% that if you find a file it fails on, you had better odds of winning the lottery. And that logic does not require a byte-order mark; it just requires realizing that UTF-8 is a rather strict encoding with a high chance of failing if it's asked to parse files that are actually from a different legacy 8-bit encoding. And that is, in fact, one of UTF-8's strengths (one guy elsewhere in this discussion thought that was a weakness of UTF-8) precisely because it means it's safe to try UTF-8 decoding first if you have an unknown file where nobody has told you the encoding. (E.g., you don't have any HTTP headers, HTML meta tags, or XML preambles to help you).

NOW. Having said ALL that, if you are dealing with legacy software that you can't change which is expecting to default to ISO-8859-1 encoding in the absence of anything else, then the UTF-8 BOM is still useful in that specific context. And you, in particular, sound like that's the case for you. So go ahead and use a UTF-8 BOM; it won't hurt in most cases, and it will actually help you. But MOST of the world is not in your situation; for MOST of the world, the UTF-8 BOM causes more problems than it solves. Which is why the default for ALL new software should be to try parsing UTF-8 first if you don't know what the encoding is, and try other encodings only if the UTF-8 parse fails. And when writing a file, it should always be UTF-8 without BOM unless the user explicitly requests something else.

mikelabatt 3 days ago [ - ]

Even the Amiga with its 8-bit text encoding was 40 years ago. Are you saying that for some radical reason modern apps on any platform should refuse to process a BOM? Parsing (skipping) a simple BOM header isn't the same as becoming fully Unicode-aware. I did not invent the BOM for UTF-8, it's there in the wild. We better be able to read it, or else we will have this religious debate (and technical issues porting and parsing texts across platforms) for the next 40 years.

rmunn 3 days ago [ - ]

That's not what I'm saying at all, I'm saying that in the absence of a BOM header a Unicode-aware app should guess UTF-8 first and then guess other likely encodings second, because the chance of false positives on the "is this UTF-8?" guess is practically indistinguishable from zero. If it isn't UTF-8, the UTF-8 parsing attempt is nearly guaranteed to fail, so it's safe to do first.

I'm also saying that apps should not create a BOM header any more (in UTF-8 only, not in UTF-16 where it's required), because the costs of dealing with BOM headers are higher than they're worth. Except in certain specific circumstances, like having to deal with pre-Unicode apps that default to assuming 8-bit encodings.

mikelabatt 2 days ago [ - ]

Makes sense, thank you. The observation about false positives for UTF-8 tending to zero helps understand. So I will vote for UTF-8 without BOM from now on (while encouraging parsers to deal with it, if present).

3036e4 3 days ago [ - ]

Also some XML parsers I used choked on UTF-8 BOMs. Not sure if valid XML is allowed to have anything other than clean ASCII in the first few characters before declaring what the encoding is?

rmunn 3 days ago [ - ]

My search also turned this up:

https://x.com/jbogard/status/1111328911609217025

If that link doesn't work, then try:

https://xcancel.com/jbogard/status/1111328911609217025

Source (which will explain the joke for anyone who didn't get it immediately):

https://www.jimmybogard.com/the-curious-case-of-the-json-bom...

rmunn 3 days ago [ - ]

Not ALL of the 20th-century Internet has bit-rotted and fallen apart yet. (Just most of it).