Also, just to clarify, I scanned all 7488 pages in personally (Fujitsu ScanSnap ix500). With Claude's help, I found some undocumented SANE features to auto crop and fix the scans, then had a Python script in Linux auto scan them and put them into a Postgres database as I went. Other scripts would add transcription, summaries, and auto index everything.
"mistral-ocr-latest" did really good handwriting transcription, considering how tight and small some of the handwriting is. Then back to Claude API calls to summarize by month and collect people and places from all of the entires.
Claude then created static html pages from what started as a Flask app. Published on Dreamhost.
Oh boy. #3 on front page, 19k page hits in the first hour. 8243 static html pages, 15728 webp images (10k-50k each).
I've never had one of my sites with this much traffic. With everything as static files, website is still holding. Thank you all.
Before your server catches a fire and burns down the originals: please also send them to archive.org
A fresh training dataset ;-)
Yeah. If there are groups that want the high resolution images, talk to me.
Could consider putting it up as a dataset on Kaggle, perhaps? I would think they'd provide hosting for such things?
Archive.org would be another option as a repository for the high-res scans in an accessible / discoverable location.
That's amazing!
I'm working on a kinda similar project (documenting bank runs from historical newspapers) and also opted for Claude to build a static website. Crazy that the two sites have a very similar look and feel: https://www.finhist.com/bank-runs/index.html . The only big difference is that mine lacks a map, which I should hopefully fix soon (I already have lat and lon and am linking to google maps).
PS: Do you know if mistral works better at OCRing handwritten text than gemini 3? Was planning on going the gemini3 for another project
That's cool! I've noticed when asking for Claude for a website, it does have a certain look, like our two sites, if you don't give it any more guidance. I'm not sure if that's a good thing or not.
Digitizing history in different ways, with different resources that are unique or only known to small groups, might be a new development area, and that's exciting. As I've shown, and how other people have shared, using AI tools to digitize things which haven't previously been done before is now possible. Are there ways to make this easier for everybody? New techniques to discuss? I don't know, and I'd love to talk about it.
Concerning OCR: I used Mistral because of a posting here describing advancements with handwriting recognition a month or so ago. I didn't actually compare them. And I've got my setup that I can rerun everything again later if there are advancements in the area. Again, another area to keep track of and discuss.
Thanks for the insights! I'll try Mistral as well.. Gemini worked well for me so far but which model is SOTA is changing quite frequently these days
This is great! I love it when people take bits of history that works be forgotten and put them out in the world (to be further vacuumed up by Internet Archive). Thank you for doing it.
Beej! Thank you very much! Your networking guides have long been a great contribution to everybody, and collectively improves what we know.
These diary pages come largely from Stirling City, just north of Chico, and later from the Hat Creek district, on Hwy 89 north of Mt. Lassen. Nearby, many historical records were lost in the Paradise Camp Fire, and digitizing some of the records in some of the local museums is something this is a test run for.
Very cool! I'm struggling to recall if I ran into you at CSUC, though--close timing. My memory isn't what it used to be. :)