Hacker News

UTF-8 is as good as a design as could be expected, but Unicode has scope creep issues. What should be in Unicode?

Coming at it naively, people might think the scope is something like "all sufficiently widespread distinct, discrete glyphs used by humans for communication that can be printed". But that's not true, because

* It's not discrete. Some code points are for combining with other code points.

* It's not distinct. Some glyphs can be written in multiple ways. Some glyphs which (almost?) always display the same, have different code points and meanings.

* It's not all printable. Control characters are in there - they pretty much had to be due to compatibility with ASCII, but they've added plenty of their own.

I'm not aware of any Unicode code points that are animated - at least what's printable, is printable on paper and not just on screen, there are no marquee or blink control characters, thank God. But, who knows when that invariant will fall too.

By the way, I know of one utf encoding the author didn't mention, utf-7. Like utf-8, but assuming that the last bit wasn't safe to use (apparently a sensible precaution over networks in the 80s). My boss managed to send me a mail encoded in utf-7 once, that's how I know what it is. I don't know how he managed to send it, though.

Cloudef 4 days ago [ - ]

Indeed, one pain point of unicode is CJK unification. https://heistak.github.io/your-code-displays-japanese-wrong/

asddubs 4 days ago [ - ]

the fact that there is seemingly no interest in fixing this, and if you want chinese and japanese in the same document, you're just fucked, forever, is crazy to me.

They should add separate code points for each variant and at least make it possible to avoid the problem in new documents. I've heard the arguments against this before, but the longer you wait, the worse the problem gets.

Cloudef 3 days ago [ - ]

Afaik theres some language hints nowadays but its kinda hack

meindnoch 3 days ago [ - ]

What happens if you want both single-storey "a" and double-storey "a" in the same document? You use a different font.

asddubs 3 days ago [ - ]

I won't even touch the fact that what you're talking about is just a stylistic difference, rather than a language based one, and will instead say this: What if you want the cyrillic letter А and the latin letter A, which are not just the same glyph, but literally visually identical looking in the same document? Oh wait both of those have separate UTF-8 codepoints. But if you want chinese and japanese characters which do not look identical in the same document, you have to resort to changing fonts? What if you're using an encoding that doesn't support specifying fonts? Your non-response doesn't solve anything and helps no one

eviks 3 days ago [ - ]

Some fonts allow for both alternatives in them

Why is the language tag not used to signal a variant?

jabedude 2 days ago [ - ]

That doesn't help in a mixed Chinese-Japanese document

eviks 2 days ago [ - ]

Why not? You don't have a single tag limit per document and can tag every mixed part with the appropriate language

jabedude a day ago [ - ]

That's not the only granularity of mixed text. A Chinese textbook about the Japanese language will have sentences where the languages are mixed

eviks a day ago [ - ]

You still haven't explained what the issue is

Chinese textbook: <ch>Chinese <jp>Mixed Japanese</jp> continue Chinese.</ch>

syncsynchalt 4 days ago [ - ]

UTF-7 is/was mostly for email, which is not an 8-bit clean transport. It is obsolete and can't encode supplemental planes (except via surrogate pairs, which were meant for UTF-16).

There is also UTF-9, from an April Fools RFC, meant for use on hosts with 36-bit words such as the PDP-10.

I meant to specify, the aim of UTF-7 is better performed by using UTF-8 with `Content-Transfer-Encoding: quoted-printable`

frollogaston 3 days ago [ - ]

The problem is the solution here. Add obscure stuff to the standard, and not everything will support it well. We got something decent in the end, different languages' scripts will mostly show up well on all sorts of computers. Apple's stuff like every possible combination of skin tone and gender family emoji might not.

pornel 4 days ago [ - ]

Unicode wanted ability to losslessly roundtrip every other encoding, in order to be easy to partially adopt in a world where other encodings were still in use. It merged a bunch of different incomplete encodings that used competing approaches. That's why there are multiple ways of encoding the same characters, and there's no overall consistency to it. It's hard to say whether that was a mistake. This level of interoperability may have been necessary for Unicode to actually win, and not be another episode of https://xkcd.com/927

panpog 3 days ago [ - ]

Why did Unicode want codepointwise round-tripping? One codepoint in a legacy encoding becoming two in Unicode doesn't seem like it should have been a problem. In other words, why include precomposed characters in Unicode?

cryptonector 3 days ago [ - ]

> * It's not discrete. Some code points are for combining with other code points.

This isn't "scope creep". It's a reflection of reality. People were already constructing compositions like this is real life. The normalization problem was unavoidable.