Kaitai is absolutely one of my favorite projects. I use it for work (parsing scientific formats, prototyping and exploring those formats, etc) as well as for fun (reverse engineering games, formats for DOSbox core dumps, etc).

I gave a guest lecture in a friend's class last week where we used Kaitai to back out the file format used in "Where in Time is Carmen Sandiego" and it was a total blast. (For me. Not sure that the class agreed? Maybe.) The Web IDE made this super easy -- https://ide.kaitai.io/ .

(On my youtube page I've got recordings of streams where I work with Kaitai to do projects like these, but somehow I am not able to work up the courage to link them here.)

I'm curious, how do you use it for Game RE?

Not the author, but also in RE.

RE, especially of older and more specialized software, involves dealing with odd undocumented binary formats. Which you may have to dissect carefully with a hex editor and a decompiler, so that you can get at the data inside.

Kaitai lets you prototype a parser for formats like that on the go, quick and easy.

A shot in the dark, but maybe you could give me a hint. Recently, I was interested in extracting sprites from an old game. I was able to reverse the file format of the data archive, which contained the game assets as files. However, I got stuck because the image files were obviously compressed. By chance, I found an open source reimplementation of the game and realised it was LZ77+Huffman compressed, but how would one detect the type of compression and parameters with only the file? That seems a pretty hard problem or are there good heuristics to detect that?

Some simpler cases like various RLE-type encodings can be figured out with that pattern recognizing brain - by staring at them really really hard.

For harder cases? You take the binaries that read or write your compressed files, load them in your tool (typically Ghidra nowadays), and track down the code that does it.

Then you either recognize what that code does (by staring at it really really hard), or try to re-implement it by hand while reading up on various popular compression algos in hope that doing this enlightens you.

Third option now: feed the decompiled or reimplemented code to the best LLM you have access to, and ask it. Those things are downright superhuman at pattern matching known algorithms, so use them, no such thing as "cheating" in RE.

The "hard mode" is compression implemented in hardware, with neither a software encoder or a software decoder available. In which case you better be ready for a lot of "feed data to the magic registers, see results, pray they give you a useful hint" type of blind hardware debugging. Sucks ass.

The "impossible" is when you have just the compressed binaries, with no encoder or decoder or plaintext data available to you at all. Better hope it's something common or simple enough or it's fucking hopeless. Solving that kind of thing is cryptoanalysis level of mind fuck and I am neither qualified enough nor insane enough to advise on that.

Another thing. For practical RE? ALWAYS CHECK PRIOR WORK FIRST. You finding an open source reimplementation? Good job, that's what you SHOULD be doing, no irony, that's what you should be doing ALWAYS. Always check whether someone has been there and done that! Always! Check whether someone has worked on this thing, or the older version of it, or another game in the same engine - anything at all. Can save you literal months of banging your head against the wall.

Thanks for your reply and advice! I guess what you describe as "impossible" is the case I am mostly interested in, though more for non-executable binary data. If I am not mistaken, this goes under the term "file fragment classification", but I have been wondering if practitioners might have figured out some better ways than what one can find in scholarly articles.

And yes, searching for the reimplementation beforehand would have saved me some hours :D

It's not about the data being executable. It's about having access to whatever reads or writes this data.

Whatever reads or writes this data has to be able to compress or decompress it. And with any luck, you'll be able to take the compression magic sauce from there.

I understood "binaries" in "compressed binaries" as "executables", e.g. like a packed executable, but I see that you mean indeed a binary file (and not e.g. a text file).

Reread that just now, sorry for not making it clearer. I kind of just used "binaries" in both senses? Hope the context clears it up.