I tried it on a enwik9 100 mb slice and was able to compress it to 20 mb + 900kb transformer so 21mb.
I know the top submission was able to get it to 13 mb.
Still trying some ideas to get better compression.
I tried it on a enwik9 100 mb slice and was able to compress it to 20 mb + 900kb transformer so 21mb.
I know the top submission was able to get it to 13 mb.
Still trying some ideas to get better compression.
Since you know the size of the file beforehand you may be able to overfit some kind of text diffusion model instead of a transformer? May allow you to partially correct the model output using some other method and then fill in the blanks that were wrong from previous generations.
Oh, sounds interesting. I hadn't considered using a diffusion model for this. My current approach generates the document byte by byte with an autoregressive transformer, so I'm curious how a diffusion model would improve memorization or reconstruction quality.
Can you point me to something that i can read? I really wanna try this approach , diffusion model does sounds interesting for compression.
Which slice? The large text compression benchmark uses enwik8 for a "smaller" input that is easily reproducible. The predictability of enwik9 can vary significantly depending on where in the file you are, as shown by Matt Mahoney https://www.mattmahoney.net/dc/textdata.html