Three questions:

1. How much was AI used to generate documentation for this project?

2. The 100MB CSV data sources are not provided in the repo so it doesn't seem possible to reproduce your results. The enwik9 dataset says it is a "slice" of the larger data set, and there are many NYC taxi trip record datasets that exist. Can you provide the datasets used to generate your results?

3. I am surprised to see performance comparisons only between your transformer and WinZIP. What were your results when comparing your transformer to more modern approaches like LZMA2 (level 9), BZIP2 and ZPAQ (max effort)?

1. I wrote the content as what i want to mention in the documentation and just used AI to polish it so that its easy to understand, is it hard to understand the documentation right now?

2. Have added the link for downloading both the enwik9 slice and the nyc dataset. Apologies I forgot to add it.

You can get it from here - https://github.com/samyak112/pym-particles/blob/main/README....

3. Other than zip i tested it with zstd19, and now that you mentioned LZMA2 and BZIP2

I got results on enwik9 100mb slice as

zstd - 28mb bzip2 - 30mb lzma2 - 26mb

I will mention these and results from ZPAQ in the readme for both files, thanks for pointing them out!!!

But the thing is this neural compression approach cant be used right now, as it takes hours to compress and de compress a 100mb file so not really usable and more of a fun project.

Appreciate the followup!!!

There were a few tells of AI based on my use of AI for personal projects, especially the end section where it says "what I tried that didn't work", I've seen Claude put sections like that in the documentation.

Big thanks for linking the datasets. Here's my results from ZPAQ (max effort):

./zpaq.exe add archive.zpaq nyc_taxi_dataset_100mb_slice.txt -m5

Time: 218 seconds

Final size: 9.57MB

./zpaq.exe add archive.zpaq enwik_9_slice_100mb.txt -m5

Time: 199 seconds

Final size: 20.46MB

So, your approach is comparable to ZPAQ for the wiki dataset and achieves a better compression ratio than ZPAQ with the taxi dataset. Cool!

Bit of a tangent but if you're curious there's an interesting writeup from a few years back that compares lossless text compression algorithms at various effort levels (speed vs compression ratio). I read it recently which prompted all of my questions

https://giannirosato.com/blog/post/lossless-data-comp/

Ohh that "what I tried that didn't work" was a section that I specifically wrote myself (and then polished with AI) because I wanted to document what are the different approaches I tried to compress more but failed.

Also thanks for the reference looks like a interesting read.

These algorithms let you specify a compression level - please note in the docs which you used. The window size can also be adjusted. Zstd might default to 4, which is "goodish compression but fast"

I tried with zstd 19