Yes, you can do it with minimal allocations - provided that the source buffer is read-only or is mutable but is unused later directly by the caller. If the buffer is mutable, any un-escaping can be done in-place because the un-escaped string will always be shorter. All the substrings you want are already in the source buffer. You just need a growable array of pointer/length pairs to know where tokens start.
Yes! The JSON library I wrote for the Zephyr RTOS does this. Say, for instance, you have the following struct:
struct SomeStruct {
char *some_string;
int some_number;
};
You would need to declare a descriptor, linking each field to how it's spelled in the JSON (e.g. the some_string member could be "some-string" in the JSON), the byte offset from the beginning of the struct where the field is (using the offsetof() macro), and the type.
The parser is then able to go through the JSON, and initialize the struct directly, as if you had reflection in the language. It'll validate the types as well. All this without having to allocate a node type, perform copies, or things like that.
This approach has its limitations, but it's pretty efficient -- and safe!
It depends what you intend to do with the parsed data, and where the input comes from and where the output will be going to. There are situations that allocations can be reduced or avoided, but that is not all of them. (In some cases, you do not need full parsing, e.g. to split an array, you can check if it is a string or not and the nesting level, and then find the commas outside of any arrays other than the first one, to be split.) (If the input is in memory, then you can also consider if you can modify that memory for parsing, which is sometimes suitable but sometimes not.)
However, for many applications, it will be better to use a binary format (or in some cases, a different text format) rather than JSON or XML.
(For the PostScript binary format, there is no escaping, and the structure does not need to be parsed and converted ahead of time; items in an array are consecutive and fixed size, and data it references (strings and other arrays) is given by an offset, so you can avoid most of the parsing. However, note that key/value lists in PostScript binary format is nonstandard (even though PostScript does have that type, it does not have a standard representation in the binary object format), and that PostScript has a better string type than JavaScript but a worse numeric type than JavaScript.)
Yes, you can first validate the buffer, to know it contains valid JSON, and then you can work with pointers to beginings of individual syntactic parts of JSON, and have functions that decide what type of the current element is, or move to the next element, etc. Even string work (comparisons with other escaped or unescaped strings, etc.) can be done on escaped strings directly without unescaping them to a buffer first.
Ergonomically, it's pretty much the same as parsing the JSON into some AST first, and then working on the AST. And it can be much faster than dumb parsers that use malloc for individual AST elements.
You can even do JSON path queries on top of this, without allocations.
Theoretically yes. Practically there is character escaping.
That kills any non-allocation dreams. Moment you have "Hi \uxxxx isn't the UTF nice?" you will probably have to allocate. If source is read-only you have to allocate. If source is mutable you have to waste CPU to rewrite the string.
I'm confused why this would be a problem. UTF-8 and UTF-16 (the only two common unicode subsets) are a maximum of 4 bytes wide (and, most commonly, 2 in English text). The ASCII representation you gave is 6-bytes wide. I don't know of many ASCII unicode representations that have less bytewidth than their native Unicode representation.
Same goes for other characters such as \n, \0, \t, \r, etc. All half in native byte representation.
It’s just two pointers the current place to write and the current place to read, escapes are always more characters than they represent so there’s no danger of overwriting the read pointer. If you support compression this can become somewhat of and issue but you simply support a max block size which is usually defined by the compression algorithm anyway.
You have a read buffer and somewhere where you have to write to.
Even if we pretend that the read buffer is not allocating (plausible), you will have to allocate for the write source for the general case (think GiB or TiB of XML or JSON).
Not if you are doing buffered reads, where you replace slow file access with fast memory access. This buffer is cleared every X bytes processed.
Writing to it would be pointless because clears obliterate anything written; or inefficient because you are somehow offsetting clears, which would sabotage the buffered reading performance gains.
I thought we were talking about high performance parsing. Of which buffered reads are one. Other is loading entire document into mutable memory, which also has limitations.
It is conceivable to deal with escaping in-place, and thus remain zero-alloc. It's hideous to think about, but I'll bet someone has done it. Dreams are powerful things.
Of course. You can do it in a single pass/just parse the token stream. There are various implementations like: https://zserge.com/jsmn/
It requires manual allocation of an array of tokens. So it needs a backing "stack vector" of sorts.
And what about escapes?
For escapes you can mutate the raw buffer with data in place, since a single escape always expands to fewer characters than the escape itself.
Yes, you can do it with minimal allocations - provided that the source buffer is read-only or is mutable but is unused later directly by the caller. If the buffer is mutable, any un-escaping can be done in-place because the un-escaped string will always be shorter. All the substrings you want are already in the source buffer. You just need a growable array of pointer/length pairs to know where tokens start.
Yes! The JSON library I wrote for the Zephyr RTOS does this. Say, for instance, you have the following struct:
You would need to declare a descriptor, linking each field to how it's spelled in the JSON (e.g. the some_string member could be "some-string" in the JSON), the byte offset from the beginning of the struct where the field is (using the offsetof() macro), and the type.The parser is then able to go through the JSON, and initialize the struct directly, as if you had reflection in the language. It'll validate the types as well. All this without having to allocate a node type, perform copies, or things like that.
This approach has its limitations, but it's pretty efficient -- and safe!
Someone wrote a nice blog post about (and even a video) it a while back: https://blog.golioth.io/how-to-parse-json-data-in-zephyr/
The opposite is true, too -- you can use the same descriptor to serialize a struct back to JSON.
I've been maintaining it outside Zephyr for a while, although with different constraints (I'm not using it for an embedded system where memory is golden): https://github.com/lpereira/lwan/blob/master/src/samples/tec...
It depends what you intend to do with the parsed data, and where the input comes from and where the output will be going to. There are situations that allocations can be reduced or avoided, but that is not all of them. (In some cases, you do not need full parsing, e.g. to split an array, you can check if it is a string or not and the nesting level, and then find the commas outside of any arrays other than the first one, to be split.) (If the input is in memory, then you can also consider if you can modify that memory for parsing, which is sometimes suitable but sometimes not.)
However, for many applications, it will be better to use a binary format (or in some cases, a different text format) rather than JSON or XML.
(For the PostScript binary format, there is no escaping, and the structure does not need to be parsed and converted ahead of time; items in an array are consecutive and fixed size, and data it references (strings and other arrays) is given by an offset, so you can avoid most of the parsing. However, note that key/value lists in PostScript binary format is nonstandard (even though PostScript does have that type, it does not have a standard representation in the binary object format), and that PostScript has a better string type than JavaScript but a worse numeric type than JavaScript.)
Yes, you can first validate the buffer, to know it contains valid JSON, and then you can work with pointers to beginings of individual syntactic parts of JSON, and have functions that decide what type of the current element is, or move to the next element, etc. Even string work (comparisons with other escaped or unescaped strings, etc.) can be done on escaped strings directly without unescaping them to a buffer first.
Ergonomically, it's pretty much the same as parsing the JSON into some AST first, and then working on the AST. And it can be much faster than dumb parsers that use malloc for individual AST elements.
You can even do JSON path queries on top of this, without allocations.
Eg. https://xff.cz/git/megatools/tree/lib/sjson.c
Yep, no problem. In place parsing only requires a stack. Stack length is the maximum JSON nesting allowed. I have a C dialect exactly like that.
Theoretically yes. Practically there is character escaping.
That kills any non-allocation dreams. Moment you have "Hi \uxxxx isn't the UTF nice?" you will probably have to allocate. If source is read-only you have to allocate. If source is mutable you have to waste CPU to rewrite the string.
I'm confused why this would be a problem. UTF-8 and UTF-16 (the only two common unicode subsets) are a maximum of 4 bytes wide (and, most commonly, 2 in English text). The ASCII representation you gave is 6-bytes wide. I don't know of many ASCII unicode representations that have less bytewidth than their native Unicode representation.
Same goes for other characters such as \n, \0, \t, \r, etc. All half in native byte representation.
> Moment you have "Hi \uxxxx isn't the UTF nice?" you will probably have to allocate.
Depends on what you are doing with it. If you aren't displaying it (and typically you are not in a server application), you don't need to unescape it.
And this is indeed something that the C++ Glaze library supports, to allow for parsing into a string_view pointing into the original input buffer.
It’s just two pointers the current place to write and the current place to read, escapes are always more characters than they represent so there’s no danger of overwriting the read pointer. If you support compression this can become somewhat of and issue but you simply support a max block size which is usually defined by the compression algorithm anyway.
If you have a place to write, then it's not zero allocation. You did an allocation.
And usually if you want maximum performance, buffered read is the way to go, which means you need a write slab allocation.
> If you have a place to write, then it's not zero allocation. You did an allocation.
Where did that allocation happen? You can write into the buffer you're reading from, because the replacement data is shorter than the original data.
You have a read buffer and somewhere where you have to write to.
Even if we pretend that the read buffer is not allocating (plausible), you will have to allocate for the write source for the general case (think GiB or TiB of XML or JSON).
> You have a read buffer and somewhere where you have to write to.
The "somewhere you have to write to" is the same buffer you are reading from.
Not if you are doing buffered reads, where you replace slow file access with fast memory access. This buffer is cleared every X bytes processed.
Writing to it would be pointless because clears obliterate anything written; or inefficient because you are somehow offsetting clears, which would sabotage the buffered reading performance gains.
Maybe I missed it, but ITT we were talking about C buffers, not buffered reads.
I thought we were talking about high performance parsing. Of which buffered reads are one. Other is loading entire document into mutable memory, which also has limitations.
> Practically there is character escaping
The voice of experience appears. Upvoted.
It is conceivable to deal with escaping in-place, and thus remain zero-alloc. It's hideous to think about, but I'll bet someone has done it. Dreams are powerful things.
> Can you do parsing of JSON and XML without allocating?
If the source JSON/XML is in a writeable buffer, with some helper functions you can do it. I've done it for a few small-memory systems.