It can represent an UTF-8 string, so it can probably represent anything.

UTF-8 is a regular language (as a subset of all octet strings), so that doesn’t feel like much of a benchmark? Something like TIFF or PECOFF would seem to be a more reasonable standard. (PDF is probably too much to ask, seeing as understanding the structure requires a full Deflate decoder among other things.)

You can handle deflate with katai or write custom handlers in Python.

https://doc.kaitai.io/user_guide.html#process https://github.com/kaitai-io/kaitai_compress/blob/master/pyt...

As binary formats go, UTF-8 is extremely tame. Some of the complexities that binary formats love to throw at you:

* Things may be non-byte-aligned bitstreams.

* Arrays of structures that go "read until id is 5, but if id is 5, nothing else of the structure is emitted."

* Fields that may be optional if some parent of the current record has some weird value.

* Files may be composed of records at arbitrary, random offsets that essentially require seeking to make any sense of it.

* The metadata of your structure may depend on some early parameter (for example, is this field big-endian or little-endian?)

and so on.

File formats like ELF (supporting ELF32, ELF64, and both little-endian and big-endian, all in a single format definition) or Java class files (long and double entries in the constant pool take up two slots, not one) are a better guideline for how powerful the format is in handling weirder idiosyncracies.

I found their ELF format specification to have a decent coverage, even if not completely exhaustive (e.g. some debug info isn't breakdown after a certain point, but it just might be incomplete rather than limitations).

> Things may be non-byte-aligned bitstreams.

* https://doc.kaitai.io/user_guide.html#_bit_sized_integers

> Arrays of structures that go "read until id is 5, but if id is 5, nothing else of the structure is emitted."

* https://doc.kaitai.io/user_guide.html#_repetitions

> Fields that may be optional if some parent of the current record has some weird value.

* https://doc.kaitai.io/user_guide.html#do-nothing

> Files may be composed of records at arbitrary, random offsets that essentially require seeking to make any sense of it.

* https://doc.kaitai.io/user_guide.html#_relative_positioning

> The metadata of your structure may depend on some early parameter (for example, is this field big-endian or little-endian?)

* https://doc.kaitai.io/user_guide.html#param-types

* https://doc.kaitai.io/user_guide.html#switch-advanced

There are ID3 tags used for MP3 and other files. Old players may not know about them and may misread their data as MPEG frames. To prevent this a tag may break up sequences of 00 bytes with an FF byte. Or may not, because now most players are aware of tags. So there is a preference, at two levels, default and for a single tag. Not too hard to program, but rather unfriendly to a grammar-based approach.

(Another example are checksums.)