Where did you get all the data? The justice.gov site didn’t have a mass download option that I could find.
https://www.jmail.world/about
"We compiled these Epstein estate emails from the House Oversight Committee release by converting the PDFs to structured text with an LLM"
and:
"Data Sources
Gmail emails: House Oversight Committee Yahoo emails: DDoSecrets (brought to us by Drop Site News)
Document parsing and extraction powered by reducto"
Yes, also many were PPM images (or encoded as such) in PDFs and then I used (cheap/light) multimodal LLMs to classify documents from photos. It was surprisingly cheap: <$1 for a few thousand PDFs / Images.
https://www.jmail.world/about
"We compiled these Epstein estate emails from the House Oversight Committee release by converting the PDFs to structured text with an LLM"
and:
"Data Sources
TechnologyDocument parsing and extraction powered by reducto"
Yes, also many were PPM images (or encoded as such) in PDFs and then I used (cheap/light) multimodal LLMs to classify documents from photos. It was surprisingly cheap: <$1 for a few thousand PDFs / Images.