Hacker News

UTF-8 is great and I wish everything used it (looking at you JavaScript). But it does have a wart in that there are byte sequences which are invalid UTF-8 and how to interpret them is undefined. I think a perfect design would define exactly how to interpret every possible byte sequence even if nominally "invalid". This is how the HTML5 spec works and it's been phenomenally successful.

ekidd 4 days ago [ - ]

For security reasons, the correct answer on how process invalid UTF-8 is (and needs to be) "throw away the data like it's radioactive, and return an error." Otherwise you leave yourself wide open to validation bypass attacks at many layers of your stack.

account42 a day ago [ - ]

This is rarely the correct thing to do. Users don't particularly like it if you refuse to process a document because it has an error somewhere in there.

Even for identifiers you probably want to do all kinds of normalization even beyond the level of UTF-8 so things like overlong sequences and other errors are really not an inherent security issue.

modeless 4 days ago [ - ]

This is only true because the interpretation is not defined, so different implementations do different things.

cryptonector 3 days ago [ - ]

That's not true. You're just not allowed to interpret them as characters.

moefh 4 days ago [ - ]

> This is how the HTML5 spec works and it's been phenomenally successful.

Unicode does have a completely defined way to interpret invalid UTF-8 byte sequences by replacing them with the U+FFFD ("replacement character"). You'll see it used (for example) in browsers all the time.

Mandating acceptance for every invalid input works well for HTML because it's meant to be consumed (primarily) by humans. It's not done for UTF-8 because in some situations it's much more useful to detect and report errors instead of making an automatic correction that can't be automatically detected after the fact.

cryptonector 3 days ago [ - ]

> But it does have a wart in that there are byte sequences which are invalid UTF-8 and how to interpret them is undefined.

This is not a wart. And how to interpret them is not undefined -- you're just not allowed to interpret them as _characters_.

There is right now a discussion[0] about adding a garbage-in/garbage-out mode to jq/jaq/etc that allows them to read and output JSON with invalid UTF-8 strings representing binary data in a way that round-trips. I'm not for making that the default for jq, and you have to be very careful about this to make sure that all the tools you use to handle such "JSON" round-trip the data. But the clever thing is that the proposed changes indeed do not interpret invalid byte sequences as character data, so they stay within the bounds of Unicode as long as your terminal (if these binary strings end up there) and other tools also do the same.

[0] https://github.com/01mf02/jaq/issues/309