Hacker News

Yes! We used our friends at Reducto (https://reducto.ai/) for all document extraction and parsing (one of the best companies I've ever referred to YC ;) )

We did an initial parsing pass of all four DOJ document batches on Friday. This takes a raw PDF and returns chunks containing typed blocks—each with a type (Title, Text, Figure, etc.), bounding boxes, content, and confidence scores. For PDFs that were just scans of photographs (which was like 90% of new content in Friday's release), it gave in depth descriptions of those! You can type search terms like "door" at https://www.jmail.world/photos to see what I mean.

For apps like Jmail and JFlights we use their structured extraction endpoint instead—you define a schema (e.g. {from, to, subject, date, body} for emails or {departure_airport, arrival_airport, passengers[], date} for flights) and it pulls those fields directly into JSON.

The JFlights example served as the best ad for Reducto and how doc parsing technology can speed up hours of journalistic investigations like this.

See for yourself. Given this document

https://www.jmail.world/drive/HOUSE_OVERSIGHT_002031

It inferred and enriched multiple flight cards on JFlights (https://www.jmail.world/flights). I was really shook when I first saw this.