A shot in the dark, but maybe you could give me a hint. Recently, I was interested in extracting sprites from an old game. I was able to reverse the file format of the data archive, which contained the game assets as files. However, I got stuck because the image files were obviously compressed. By chance, I found an open source reimplementation of the game and realised it was LZ77+Huffman compressed, but how would one detect the type of compression and parameters with only the file? That seems a pretty hard problem or are there good heuristics to detect that?
Some simpler cases like various RLE-type encodings can be figured out with that pattern recognizing brain - by staring at them really really hard.
For harder cases? You take the binaries that read or write your compressed files, load them in your tool (typically Ghidra nowadays), and track down the code that does it.
Then you either recognize what that code does (by staring at it really really hard), or try to re-implement it by hand while reading up on various popular compression algos in hope that doing this enlightens you.
Third option now: feed the decompiled or reimplemented code to the best LLM you have access to, and ask it. Those things are downright superhuman at pattern matching known algorithms, so use them, no such thing as "cheating" in RE.
The "hard mode" is compression implemented in hardware, with neither a software encoder or a software decoder available. In which case you better be ready for a lot of "feed data to the magic registers, see results, pray they give you a useful hint" type of blind hardware debugging. Sucks ass.
The "impossible" is when you have just the compressed binaries, with no encoder or decoder or plaintext data available to you at all. Better hope it's something common or simple enough or it's fucking hopeless. Solving that kind of thing is cryptoanalysis level of mind fuck and I am neither qualified enough nor insane enough to advise on that.
Another thing. For practical RE? ALWAYS CHECK PRIOR WORK FIRST. You finding an open source reimplementation? Good job, that's what you SHOULD be doing, no irony, that's what you should be doing ALWAYS. Always check whether someone has been there and done that! Always! Check whether someone has worked on this thing, or the older version of it, or another game in the same engine - anything at all. Can save you literal months of banging your head against the wall.
Thanks for your reply and advice! I guess what you describe as "impossible" is the case I am mostly interested in, though more for non-executable binary data. If I am not mistaken, this goes under the term "file fragment classification", but I have been wondering if practitioners might have figured out some better ways than what one can find in scholarly articles.
And yes, searching for the reimplementation beforehand would have saved me some hours :D
It's not about the data being executable. It's about having access to whatever reads or writes this data.
Whatever reads or writes this data has to be able to compress or decompress it. And with any luck, you'll be able to take the compression magic sauce from there.
I understood "binaries" in "compressed binaries" as "executables", e.g. like a packed executable, but I see that you mean indeed a binary file (and not e.g. a text file).
Reread that just now, sorry for not making it clearer. I kind of just used "binaries" in both senses? Hope the context clears it up.