UTF-8 is a regular language (as a subset of all octet strings), so that doesn’t feel like much of a benchmark? Something like TIFF or PECOFF would seem to be a more reasonable standard. (PDF is probably too much to ask, seeing as understanding the structure requires a full Deflate decoder among other things.)
As binary formats go, UTF-8 is extremely tame. Some of the complexities that binary formats love to throw at you:
* Things may be non-byte-aligned bitstreams.
* Arrays of structures that go "read until id is 5, but if id is 5, nothing else of the structure is emitted."
* Fields that may be optional if some parent of the current record has some weird value.
* Files may be composed of records at arbitrary, random offsets that essentially require seeking to make any sense of it.
* The metadata of your structure may depend on some early parameter (for example, is this field big-endian or little-endian?)
and so on.
File formats like ELF (supporting ELF32, ELF64, and both little-endian and big-endian, all in a single format definition) or Java class files (long and double entries in the constant pool take up two slots, not one) are a better guideline for how powerful the format is in handling weirder idiosyncracies.
I found their ELF format specification to have a decent coverage, even if not completely exhaustive (e.g. some debug info isn't breakdown after a certain point, but it just might be incomplete rather than limitations).
There are ID3 tags used for MP3 and other files. Old players may not know about them and may misread their data as MPEG frames. To prevent this a tag may break up sequences of 00 bytes with an FF byte. Or may not, because now most players are aware of tags. So there is a preference, at two levels, default and for a single tag. Not too hard to program, but rather unfriendly to a grammar-based approach.
UTF-8 is a regular language (as a subset of all octet strings), so that doesn’t feel like much of a benchmark? Something like TIFF or PECOFF would seem to be a more reasonable standard. (PDF is probably too much to ask, seeing as understanding the structure requires a full Deflate decoder among other things.)
You can handle deflate with katai or write custom handlers in Python.
https://doc.kaitai.io/user_guide.html#process https://github.com/kaitai-io/kaitai_compress/blob/master/pyt...
As binary formats go, UTF-8 is extremely tame. Some of the complexities that binary formats love to throw at you:
* Things may be non-byte-aligned bitstreams.
* Arrays of structures that go "read until id is 5, but if id is 5, nothing else of the structure is emitted."
* Fields that may be optional if some parent of the current record has some weird value.
* Files may be composed of records at arbitrary, random offsets that essentially require seeking to make any sense of it.
* The metadata of your structure may depend on some early parameter (for example, is this field big-endian or little-endian?)
and so on.
File formats like ELF (supporting ELF32, ELF64, and both little-endian and big-endian, all in a single format definition) or Java class files (long and double entries in the constant pool take up two slots, not one) are a better guideline for how powerful the format is in handling weirder idiosyncracies.
I found their ELF format specification to have a decent coverage, even if not completely exhaustive (e.g. some debug info isn't breakdown after a certain point, but it just might be incomplete rather than limitations).
> Things may be non-byte-aligned bitstreams.
* https://doc.kaitai.io/user_guide.html#_bit_sized_integers
> Arrays of structures that go "read until id is 5, but if id is 5, nothing else of the structure is emitted."
* https://doc.kaitai.io/user_guide.html#_repetitions
> Fields that may be optional if some parent of the current record has some weird value.
* https://doc.kaitai.io/user_guide.html#do-nothing
> Files may be composed of records at arbitrary, random offsets that essentially require seeking to make any sense of it.
* https://doc.kaitai.io/user_guide.html#_relative_positioning
> The metadata of your structure may depend on some early parameter (for example, is this field big-endian or little-endian?)
* https://doc.kaitai.io/user_guide.html#param-types
* https://doc.kaitai.io/user_guide.html#switch-advanced
There are ID3 tags used for MP3 and other files. Old players may not know about them and may misread their data as MPEG frames. To prevent this a tag may break up sequences of 00 bytes with an FF byte. Or may not, because now most players are aware of tags. So there is a preference, at two levels, default and for a single tag. Not too hard to program, but rather unfriendly to a grammar-based approach.
(Another example are checksums.)