Appreciate the followup!!!

There were a few tells of AI based on my use of AI for personal projects, especially the end section where it says "what I tried that didn't work", I've seen Claude put sections like that in the documentation.

Big thanks for linking the datasets. Here's my results from ZPAQ (max effort):

./zpaq.exe add archive.zpaq nyc_taxi_dataset_100mb_slice.txt -m5

Time: 218 seconds

Final size: 9.57MB

./zpaq.exe add archive.zpaq enwik_9_slice_100mb.txt -m5

Time: 199 seconds

Final size: 20.46MB

So, your approach is comparable to ZPAQ for the wiki dataset and achieves a better compression ratio than ZPAQ with the taxi dataset. Cool!

Bit of a tangent but if you're curious there's an interesting writeup from a few years back that compares lossless text compression algorithms at various effort levels (speed vs compression ratio). I read it recently which prompted all of my questions

https://giannirosato.com/blog/post/lossless-data-comp/

Ohh that "what I tried that didn't work" was a section that I specifically wrote myself (and then polished with AI) because I wanted to document what are the different approaches I tried to compress more but failed.

Also thanks for the reference looks like a interesting read.