Appreciate the followup!!!
There were a few tells of AI based on my use of AI for personal projects, especially the end section where it says "what I tried that didn't work", I've seen Claude put sections like that in the documentation.
Big thanks for linking the datasets. Here's my results from ZPAQ (max effort):
./zpaq.exe add archive.zpaq nyc_taxi_dataset_100mb_slice.txt -m5
Time: 218 seconds
Final size: 9.57MB
./zpaq.exe add archive.zpaq enwik_9_slice_100mb.txt -m5
Time: 199 seconds
Final size: 20.46MB
So, your approach is comparable to ZPAQ for the wiki dataset and achieves a better compression ratio than ZPAQ with the taxi dataset. Cool!
Bit of a tangent but if you're curious there's an interesting writeup from a few years back that compares lossless text compression algorithms at various effort levels (speed vs compression ratio). I read it recently which prompted all of my questions
Ohh that "what I tried that didn't work" was a section that I specifically wrote myself (and then polished with AI) because I wanted to document what are the different approaches I tried to compress more but failed.
Also thanks for the reference looks like a interesting read.