Is this able to represent any binary format? How do things like relative offsets work and such? (basically any non-rigid parts of the format)
Is this able to represent any binary format? How do things like relative offsets work and such? (basically any non-rigid parts of the format)
It can represent an UTF-8 string, so it can probably represent anything.
UTF-8 is a regular language (as a subset of all octet strings), so that doesn’t feel like much of a benchmark? Something like TIFF or PECOFF would seem to be a more reasonable standard. (PDF is probably too much to ask, seeing as understanding the structure requires a full Deflate decoder among other things.)
You can handle deflate with katai or write custom handlers in Python.
https://doc.kaitai.io/user_guide.html#process https://github.com/kaitai-io/kaitai_compress/blob/master/pyt...
As binary formats go, UTF-8 is extremely tame. Some of the complexities that binary formats love to throw at you:
* Things may be non-byte-aligned bitstreams.
* Arrays of structures that go "read until id is 5, but if id is 5, nothing else of the structure is emitted."
* Fields that may be optional if some parent of the current record has some weird value.
* Files may be composed of records at arbitrary, random offsets that essentially require seeking to make any sense of it.
* The metadata of your structure may depend on some early parameter (for example, is this field big-endian or little-endian?)
and so on.
File formats like ELF (supporting ELF32, ELF64, and both little-endian and big-endian, all in a single format definition) or Java class files (long and double entries in the constant pool take up two slots, not one) are a better guideline for how powerful the format is in handling weirder idiosyncracies.
I found their ELF format specification to have a decent coverage, even if not completely exhaustive (e.g. some debug info isn't breakdown after a certain point, but it just might be incomplete rather than limitations).
> Things may be non-byte-aligned bitstreams.
* https://doc.kaitai.io/user_guide.html#_bit_sized_integers
> Arrays of structures that go "read until id is 5, but if id is 5, nothing else of the structure is emitted."
* https://doc.kaitai.io/user_guide.html#_repetitions
> Fields that may be optional if some parent of the current record has some weird value.
* https://doc.kaitai.io/user_guide.html#do-nothing
> Files may be composed of records at arbitrary, random offsets that essentially require seeking to make any sense of it.
* https://doc.kaitai.io/user_guide.html#_relative_positioning
> The metadata of your structure may depend on some early parameter (for example, is this field big-endian or little-endian?)
* https://doc.kaitai.io/user_guide.html#param-types
* https://doc.kaitai.io/user_guide.html#switch-advanced
There are ID3 tags used for MP3 and other files. Old players may not know about them and may misread their data as MPEG frames. To prevent this a tag may break up sequences of 00 bytes with an FF byte. Or may not, because now most players are aware of tags. So there is a preference, at two levels, default and for a single tag. Not too hard to program, but rather unfriendly to a grammar-based approach.
(Another example are checksums.)
https://doc.kaitai.io/user_guide.html#_relative_positioning