Hacker News

Many Deflate encoders, including zlib/gzip, are greedy encoders that work forwards, either for speed or to support streaming compression. They encode runs as they are found scanning forward, with limited lookahead to try to allow longer matches to preempt immediately preceding shorter matches. There is an "optimal parse" strategy that maximizes the runs found by processing the entire file backwards.

If you repack the plaintext using zopfli, it does encode the text as you suggest.

Dylan16807 3 days ago [ - ]

Being greedy is fine here, though. It's the exact same match but somehow cut one byte short.

Also I just tested something. If you stick a random extra letter onto the start of the string, the mistake goes away and the output shrinks by a byte. Is it possibly an issue with finding matches that start at byte 0...?

More testing: Removing the Ts that fail to match to get TOBEORNOTOBEOROBEORNOT shrinks the output by 2 bytes. Changing the first character to ZOBEORNOTOBEOROBEORNOT stays at the same shrunken size. Then removing that Z to get OBEORNOTOBEOROBEORNOT makes it balloon back up as now it fails to match the O twice.

If I take the original string and start prepending unique letters, the first one shrinks and fixes the matching, and then each subsequent letter adds one byte. So it's not that matching needs some particular alignment, it's only failing to match the very first byte.

A new test: XPPPPPPPP appears to encode as X, P, then a copy command. And PPPPPPPPP encodes as P, P, then a copy. Super wrong.

ack_complete 3 days ago [ - ]

Interesting, I can also reproduce this. I wonder if it's an artifact of zlib's sliding window setup. The odd part is that if I try various libs with advzip, both the libdeflate and 7z modes show the same artifact, only the zopfli mode is able to avoid it. Doesn't seem to be a format violation as gzip -t doesn't complain about a copy at position 0.