Thanks! And it's a lot of info, yeah. ~90% of new data in yesterday's drop was photographs, which they redacted for us.
The House Oversight Committee's giant drop in November had tons of data we still didn't take advantage of even after doing the original Jmail, like flight logs.
For the Yahoo release, which is still ongoing, the folks at Drop Site News (see https://www.jmail.world/about) are handling the manual redaction which has been very time consuming, even with tons of AI to help in the background.
Would be nice to explain at some point how we did the structuring of the destructured data.
For now we’re focusing on fixing the bugs because we’re already seeing an insane wave of traffic so most of us are focused on keeping the site alive.
Hey, I’d be interested in your thoughts on this, or the key ideas/research results you relied on:
Yes! We used our friends at Reducto (https://reducto.ai/) for all document extraction and parsing (one of the best companies I've ever referred to YC ;) )
We did an initial parsing pass of all four DOJ document batches on Friday. This takes a raw PDF and returns chunks containing typed blocks—each with a type (Title, Text, Figure, etc.), bounding boxes, content, and confidence scores. For PDFs that were just scans of photographs (which was like 90% of new content in Friday's release), it gave in depth descriptions of those! You can type search terms like "door" at https://www.jmail.world/photos to see what I mean.
For apps like Jmail and JFlights we use their structured extraction endpoint instead—you define a schema (e.g. {from, to, subject, date, body} for emails or {departure_airport, arrival_airport, passengers[], date} for flights) and it pulls those fields directly into JSON.
The JFlights example served as the best ad for Reducto and how doc parsing technology can speed up hours of journalistic investigations like this.
See for yourself. Given this document
https://www.jmail.world/drive/HOUSE_OVERSIGHT_002031
It inferred and enriched multiple flight cards on JFlights (https://www.jmail.world/flights). I was really shook when I first saw this.
This might be our coolest case study yet. Thanks for the mention!
One interesting thread to pull is "Stuff released and then Yanked back" ...
Images removed from Epstein files less than a day after being posted - https://www.abc.net.au/news/2025-12-21/images-removed-from-e...
promises all the sleuthing excitement of chasing the significance of Donald in a Drawer.
Images were also planted to falsely suggest incriminating evidence.
while true, it would probably be useful to provide examples. The one that I am aware of seems to be a picture showing Clinton, Michael Jackson, and Diana Ross with "redacted" victims
https://www.imdb.com/news/ni65628031/
https://bsky.app/profile/meidastouch.com/post/3mag7myutmc2d
however it seems that this photo is actually taken from a 2003 Democratic fundraiser, and the redacted images of victims were of Diana Ross' son Evan, and Michael Jackson's kids, Paris and Prince Jackson. This may or may not be accurate either, since I have not been able to dig down into the photo and determine if it has any connections to a supposed 2003 fundraiser.
But it seems more likely to be true than not that this was sloppily planted evidence that was especially insultingly fake.
on edit: looking closer does not seem to be exact same photos, but instead two different photos taken at the same time and place, so in the 2003 Dem fundraising, but a different photo of that. So it could be that Epstein had it and DOJ thought hey, look at these pervs! Let's release!!
Is it possible that one is an input photo and the other is generative AI output?
As you say, it's not the same photo. If the one in the dump was in Epstein's possession, the reason for the redactions are either that some drone in the DOJ just redacted all children out of habit, or that it was deliberately done in such a way as to frame Clinton. I can't decide which I find more credible.
I think if it hadn't been those adults with the kids an alert staffer might have thought "whose kids are these, these aren't young teenage girls, I better double check" But Michael Jackson, kids, Clinton arms around him, Diana Ross with young male, they're thinking they walked into an armory filled with nothing but smoking guns!
>the reason for the redactions are either that some drone in the DOJ just redacted all children out of habit, or that it was deliberately done in such a way as to frame Clinton
They were supposed to redact all minors, not just "victims".
There’s no need to frame Clinton, there is plenty of evidence he was friends with and spent a lot of time with Epstein.
Similarly situation with Trump, for that matter.
It is perfectly possible, even common, to frame the guilty. It’s easier than finding real evidence.
Sure, but in this case there already is plenty of real evidence.
I see people are not clued into this and incredulously downvote because the file release appears to be in good faith to them such that illegal evidence tampering is out of the question
See https://news.ycombinator.com/item?id=46341688
The post you link to is deleted.
[flagged]
But, whoever’s doing the redacting sees the original right? What prevents the redactor from saying, “here’s what the document really said.” Or “here’s who’s in the image, I saw it before I redacted it?”
The idea of spending the rest of their life in prison is what stops them
Yeah but a few words from somebody like Ghislaine could completely fuck shit up for a lot of people.
Of course, she'll have hanged herself shortly afterward while the security cameras were malfunctioning.
Part of the law mandates that all redactions will be listed for Congress within 15 days.
I’d guess a first pass is done automatically? Eg if a page mentions eg Trump, just redact that whole page/paragraph/etc. So the people who have done the closer reading to redact further probably don’t actually know the scale of what was already redacted. Just a guess though.
People who they think will do this don't get to be redactors. It's all about power and relationships, not technology.
Given how MTG went completely silent despite her high profile platform, I'm guessing the civil (or at this point, royal) servants don't want their families harmed.
That’s a good point. I would imagine they break it up into pieces - in a reCAPTCHA sorta way - and any given person sees a sentence or a piece of a sentence.
An alternative would be to strip out all obvious known words and only leave unknowns (i.e., names) and then have those fragments reviewed (in a reCAPTCHA sorta way).
Finally, for images, cover all faces and the one by one decide which should remain covered and which should not.
LOTS of work but there are workflows to mitigate the ability for reviewers to connect more than they should.
I'm being snarky and this isn't such a serious comment and I don't really mean this for Gemini but can you imagine using something like Gemini ("Hi, please comb through this") and it just refuses on ethical grounds
We found that Codex indeed refuses but Claude + Gemini are willing to RAG it
also, shoutout the Jason Liu (https://news.ycombinator.com/user?id=jxnlco) for discovering that one. His turbopuffer-based version of Jemini is coming soon!
Usually Claude is the prude. Personally I haven't even tried for fear what I'd find. I can stomach homicide and war pictures, but Epstein is too much.
I just have real institutional problems with Google, they have all the best tech minds but some things are just off limits to them being politically correct
And no, not Epstein. It's a general statement; but it's disappointing that they're like this (and of course Gemini was famously the one that gave black Nazis and things like that)
Google has never fixed their black people/gorilla issue. The foundational tech that all of their products run on going back a decade is fundamentally flawed (and outputs outputs that many would say align with racist ideologies, among others).