Hacker News

tempodox 12 hours ago [ - ]

The tragic thing is that you can't do `fgetwc()` on a `FILE *` produced by `fopencookie()` on Linux. glibc will crash your program deliberately as soon as there is a non-ASCII char in that stream (because, reasons?). But it does work with `funopen()` on a BSD, like macOS. I'm using that to read wide characters from UTF-8 streams.

o11c 12 hours ago [ - ]

Wide characters are best avoided even on platforms where it doesn't mean UTF-16. It's better to stay in UTF-8 mode, and only verify that it's well-formed.

tempodox 12 hours ago [ - ]

But at some point you'll want to know whether that code point you read `iswalpha()` or whatever, so you'll have to decode UTF-8 anyway.

thechao 10 hours ago [ - ]

At the parser-level, though; not down in the lexer. I intern unique user-defined strings (just with a hashcons or whatever the cool kids call it, these days). That defers the determination of correctness of UTF-kness to "someone else".