Hacker News

What does it compress the full 1GB file to? http://prize.hutter1.net/

I tried it on a enwik9 100 mb slice and was able to compress it to 20 mb + 900kb transformer so 21mb.

I know the top submission was able to get it to 13 mb.

Still trying some ideas to get better compression.

Since you know the size of the file beforehand you may be able to overfit some kind of text diffusion model instead of a transformer? May allow you to partially correct the model output using some other method and then fill in the blanks that were wrong from previous generations.

spidy__ 3 days ago [ - ]

Oh, sounds interesting. I hadn't considered using a diffusion model for this. My current approach generates the document byte by byte with an autoregressive transformer, so I'm curious how a diffusion model would improve memorization or reconstruction quality.

Can you point me to something that i can read? I really wanna try this approach , diffusion model does sounds interesting for compression.

atiedebee 3 days ago [ - ]

Which slice? The large text compression benchmark uses enwik8 for a "smaller" input that is easily reproducible. The predictability of enwik9 can vary significantly depending on where in the file you are, as shown by Matt Mahoney https://www.mattmahoney.net/dc/textdata.html

purple-leafy 3 days ago [ - ]

Thanks for the link!

cellular 3 days ago [ - ]

Maybe everyone should compress the 1st 100MB worth of digits of pi, for an apples-to-apples comparison?

Edit: oh wait that's too easy. Need to generate /publish random digits so everyone can use it.

branc116 3 days ago [ - ]

Compressor: Output an empty file.

Decompressor: Take any old algorithm for finding digets of pi, find first 100M of them, print them.

Compression ratio of 0! :0

saulpw 3 days ago [ - ]

random digits aren't compressible though?

SV_BubbleTime 3 days ago [ - ]

Random digits are compressible though.

Random data does not mean it does not match a pattern in your dictionary for example.

3 days ago [ - ]

[deleted]

3 days ago [ - ]

[deleted]

gnabgib 3 days ago [ - ]

No.. they're not. Do you understand random (the apparent or actual lack of definite patterns or predictability[0]) or compression (reduces bits by identifying and eliminating statistical redundancy[1])?

[0]: https://en.wikipedia.org/wiki/Randomness

[1]: https://en.wikipedia.org/wiki/Data_compression

gcr 3 days ago [ - ]

I could write a program to generate the first 100MB of pi in a couple kilobytes. That certainly counts as “data compression” but isn’t useful outside this particular problem instance.

echoangle 2 days ago [ - ]

Yeah, because the digits of pi aren’t random.

IncreasePosts 3 days ago [ - ]

Over infinite runs, you can't compress random data, but that doesn't mean any finite string of random digits is incompressible

3 days ago [ - ]

[deleted]

thin_carapace 3 days ago [ - ]

by this definition, a random dataset could apparently present no patterns, while presenting non apparent patterns.

ufocia 3 days ago [ - ]

Sounds like presenting no patterns, apparently or otherwise, would be a pattern in itself.

saulpw 3 days ago [ - ]

That's like a teenage "i am very smart" thinking. I mean sure we can look at some string of random bits and say "that looks random" but you can't just generate any old string of random bits to replace it (which would be the only 'pattern' that could be leveraged for compression here). If it's encrypted it'll also appear random, and therefore not be compressible, but you have to encode every byte exactly or the message won't be decryptable.

3 days ago [ - ]

[deleted]