Okay, so I know back in the day you could choke scanning software (ie email attachment scanners) by throwing a zip bomb into them. I believe the software has gotten smarter these days so it won’t simply crash when that happens - but how is this done; How does one detect a zip bomb?
I don't understand the code itself, but here's Debian's patch to detect overlapping zip bombs in `unzip`:
https://sources.debian.org/patches/unzip/6.0-29/23-cve-2019-...
So effectively it seems as though it just keeps track of which parts of the zip file have already been 'used', and if a new entry in the zip file starts in a 'used' section then it fails.I wonder if this has actually been used for backing up in real use cases (think how LVM or ZFS do snapshotting)?
I.e. an advanced compressor could abuse the zip file format to share base data for files which only incrementally change (get appended to, for instance).
And then this patch would disallow such practice.
For any compression algorithm in general, you keep track of A = {uncompressed bytes processed} and B = {compressed bytes processed} while decompressing, and bail out when either of the following occur:
1. A exceeds some unreasonable threshold
2. A/B exceeds some unreasonable threshold
In practice one of the things that happens very often is that you compress a file filled with null bytes. Such files compress extremely well, and would trigger your A/B threshold.
On the other hand, zip bomb described in this blog post relies on decompressing the same data multiple times - so it wouldn't trigger your A/B heuristics necessarily.
Finally, A just means "you can't compress more than X bytes with my file format", right? Not a desirable property to have. If deflate authors had this idea when they designed the algorithm, I bet files larger than "unreasonable" 16MB would be forbidden.
> In practice one of the things that happens very often is that you compress a file filled with null bytes. Such files compress extremely well, and would trigger your A/B threshold.
Sure, if you expect to decompress files with high compression ratios, then you'll want to adjust your knobs accordingly.
> On the other hand, zip bomb described in this blog post relies on decompressing the same data multiple times - so it wouldn't trigger your A/B heuristics necessarily.
If you decompress the same data multiple times, then you increment A multiple times. The accounting still works regardless of whether the data is same or different. Perhaps a better description of A and B in my post would be {number of decompressed bytes written} and {number of compressed bytes read}, respectively.
> Finally, A just means "you can't compress more than X bytes with my file format", right? Not a desirable property to have. If deflate authors had this idea when they designed the algorithm, I bet files larger than "unreasonable" 16MB would be forbidden.
The limitation is imposed by the application, not by the codec itself. The application doing the decompression is supposed to process the input incrementally (in the case of DEFLATE, reading one block at a time and inflating it), updating A and B on each iteration, and aborting if a threshold is violated.
Embarrsingly simple for a scanner too as you just mark as suspicious when this happens. You can be wrong sometimes and this is expected