What does it compress the full 1GB file to? http://prize.hutter1.net/

I tried it on a enwik9 100 mb slice and was able to compress it to 20 mb + 900kb transformer so 21mb.

I know the top submission was able to get it to 13 mb.

Still trying some ideas to get better compression.

Since you know the size of the file beforehand you may be able to overfit some kind of text diffusion model instead of a transformer? May allow you to partially correct the model output using some other method and then fill in the blanks that were wrong from previous generations.

Oh, sounds interesting. I hadn't considered using a diffusion model for this. My current approach generates the document byte by byte with an autoregressive transformer, so I'm curious how a diffusion model would improve memorization or reconstruction quality.

Can you point me to something that i can read? I really wanna try this approach , diffusion model does sounds interesting for compression.

Which slice? The large text compression benchmark uses enwik8 for a "smaller" input that is easily reproducible. The predictability of enwik9 can vary significantly depending on where in the file you are, as shown by Matt Mahoney https://www.mattmahoney.net/dc/textdata.html

Thanks for the link!

Maybe everyone should compress the 1st 100MB worth of digits of pi, for an apples-to-apples comparison?

Edit: oh wait that's too easy. Need to generate /publish random digits so everyone can use it.

Compressor: Output an empty file.

Decompressor: Take any old algorithm for finding digets of pi, find first 100M of them, print them.

Compression ratio of 0! :0

random digits aren't compressible though?

Random digits are compressible though.

Random data does not mean it does not match a pattern in your dictionary for example.

[deleted]
[deleted]

No.. they're not. Do you understand random (the apparent or actual lack of definite patterns or predictability[0]) or compression (reduces bits by identifying and eliminating statistical redundancy[1])?

[0]: https://en.wikipedia.org/wiki/Randomness

[1]: https://en.wikipedia.org/wiki/Data_compression

I could write a program to generate the first 100MB of pi in a couple kilobytes. That certainly counts as “data compression” but isn’t useful outside this particular problem instance.

Yeah, because the digits of pi aren’t random.

Over infinite runs, you can't compress random data, but that doesn't mean any finite string of random digits is incompressible

[deleted]

by this definition, a random dataset could apparently present no patterns, while presenting non apparent patterns.

Sounds like presenting no patterns, apparently or otherwise, would be a pattern in itself.

That's like a teenage "i am very smart" thinking. I mean sure we can look at some string of random bits and say "that looks random" but you can't just generate any old string of random bits to replace it (which would be the only 'pattern' that could be leveraged for compression here). If it's encrypted it'll also appear random, and therefore not be compressible, but you have to encode every byte exactly or the message won't be decryptable.

[deleted]