Hacker News

Theoretically yes. Practically there is character escaping.

That kills any non-allocation dreams. Moment you have "Hi \uxxxx isn't the UTF nice?" you will probably have to allocate. If source is read-only you have to allocate. If source is mutable you have to waste CPU to rewrite the string.

deaddodo 3 days ago [ - ]

I'm confused why this would be a problem. UTF-8 and UTF-16 (the only two common unicode subsets) are a maximum of 4 bytes wide (and, most commonly, 2 in English text). The ASCII representation you gave is 6-bytes wide. I don't know of many ASCII unicode representations that have less bytewidth than their native Unicode representation.

Same goes for other characters such as \n, \0, \t, \r, etc. All half in native byte representation.

lelanthran 4 days ago [ - ]

> Moment you have "Hi \uxxxx isn't the UTF nice?" you will probably have to allocate.

Depends on what you are doing with it. If you aren't displaying it (and typically you are not in a server application), you don't need to unescape it.

mpyne 3 days ago [ - ]

And this is indeed something that the C++ Glaze library supports, to allow for parsing into a string_view pointing into the original input buffer.

_3u10 3 days ago [ - ]

It’s just two pointers the current place to write and the current place to read, escapes are always more characters than they represent so there’s no danger of overwriting the read pointer. If you support compression this can become somewhat of and issue but you simply support a max block size which is usually defined by the compression algorithm anyway.

Ygg2 2 days ago [ - ]

If you have a place to write, then it's not zero allocation. You did an allocation.

And usually if you want maximum performance, buffered read is the way to go, which means you need a write slab allocation.

lelanthran 2 days ago [ - ]

> If you have a place to write, then it's not zero allocation. You did an allocation.

Where did that allocation happen? You can write into the buffer you're reading from, because the replacement data is shorter than the original data.

Ygg2 2 days ago [ - ]

You have a read buffer and somewhere where you have to write to.

Even if we pretend that the read buffer is not allocating (plausible), you will have to allocate for the write source for the general case (think GiB or TiB of XML or JSON).

lelanthran 2 days ago [ - ]

> You have a read buffer and somewhere where you have to write to.

The "somewhere you have to write to" is the same buffer you are reading from.

Ygg2 2 days ago [ - ]

Not if you are doing buffered reads, where you replace slow file access with fast memory access. This buffer is cleared every X bytes processed.

Writing to it would be pointless because clears obliterate anything written; or inefficient because you are somehow offsetting clears, which would sabotage the buffered reading performance gains.

lelanthran a day ago [ - ]

Maybe I missed it, but ITT we were talking about C buffers, not buffered reads.

Ygg2 a day ago [ - ]

I thought we were talking about high performance parsing. Of which buffered reads are one. Other is loading entire document into mutable memory, which also has limitations.

topspin 3 days ago [ - ]

> Practically there is character escaping

The voice of experience appears. Upvoted.

It is conceivable to deal with escaping in-place, and thus remain zero-alloc. It's hideous to think about, but I'll bet someone has done it. Dreams are powerful things.