I specialize in protocol design, unfortunately. A while ago I had to code some Unicode conversion routines from scratch and I must say I absolutely admire UTF-8. Unicode per se is a dumpster fire, likely because of objective reasons. Dealing with multiple Unicode encodings is a minefield. I even made an angry write-up back then https://web.archive.org/web/20231001011301/http://replicated...

UTF-8 made it all relatively neat back in the day. There are still ways to throw a wrench into the gears. For example, how do you handle UTF-8 encoded surrogate pairs? But at least one can filter that out as suspicious/malicious behavior.

> For example, how do you handle UTF-8 encoded surrogate pairs?

Surrogate pairs aren’t applicable to UTF-8. That part of Unicode block is just invalid for UTF-8 and should be treated as such (parsing error or as invalid characters etc).

In theory, yes. In practice, there are throngs of parsers and converters who might handle such cases differently. https://seriot.ch/projects/parsing_json.html

I mean hopefully not, but the linked example is about JSON parsing, not UTF-8.

A big chunk of bugs there are Unicode related, that is my point. When people parse JSON they don't think that they also parse Unicode.

> Unicode per se is a dumpster fire

Maybe as to emojis, but otherwise, no, Unicode is not a dumpster fire. Unicode is elegant, and all the things that people complain about in Unicode are actually problems in human scripts.