Hacker News

smallstepforman a day ago [ - ]

Today I asked Gemini to extract a table from an PDF appendix and create C++ data table with its contents. After 15 or so iterations with corrections and new mistakes, it eventually gave up. I was floored when it said “I’m sorry, I cannot do this simple task, I’ve exceeded my error threshold and cannot do this task for you. My LLM prediction engine invents data instead of doing a simple data copy/reformat”.

Stunned to see that Gemini threw its digital arms in the air and gave up.

hashta a day ago [ - ]

That's interesting because my experience has been almost the opposite. A few months ago I tested Gemini on converting screenshots of tables from PDF files into CSV. I tried it on several different tables and it got every one right. It consistently outperformed ChatGPT.

lxgr 9 hours ago [ - ]

The key here is that you used screenshots. This forces Gemini into "OCR mode" (i.e. actually looking at vision tokens) rather than trying to be clever with its tool calls.

The latter strategy almost entirely depends on the quality of the skills and tool calls exposed to a given agent.

blensor 13 hours ago [ - ]

Tangentially related question. Has anyone analyzed if the content that is being converted could break the model.

So let's say you have a super dull pdf ( or even a scan ) that has the same line over and over again, could this get the model into one of those loops that just keep spewing nonsense.

And thinking that further, could someone prompt inject a model with a handwritten note that only gets "activated" once it's in the context?

jatora 21 hours ago [ - ]

anyone who has used both knows this is inaccurate or dishonestly stated (ie. you were using gpt nano or some nonsense)

hashta 3 hours ago [ - ]

I used chatgpt's web app and I have pro subscription.

anigbrowl 20 hours ago [ - ]

It's extremely hit or miss. I've had it one-shot a pretty decent analytic prototype from a brief description, but also had it get trapped in hour-long back and forth regression hell over incredibly simple things like adding a static favicon (ie it would add it, then keep taking it away with every subsequent iteration, breaking something else every time it was asked to put the favicon back etc.).

frankacter 17 hours ago [ - ]

I just tried this and it worked without issue.

Some considerations:

1) tell it to extra t the data (in a new session) does that work?

2) if it doesn't, could there be something up with the PDF?

As many commentors suggested, this works well with Gemini so there is likely a missing variable in play.

Share your prompt and the PDF and let's see if we can determine what.

TightFibre 11 hours ago [ - ]

Long shot, but I wonder if an image of the pdf would do better if it did get unstuck on internal formats.

lxgr 9 hours ago [ - ]

It definitely does. PDF is a vector-based image format historically, and all add-ons that make it behave a bit more sane as a text-oriented document format are optional, so your mileage using tools like pdftotext will vary greatly depending on who created a given PDF.

base698 a day ago [ - ]

That's better than the loop grok got stuck in trying to use git and push the work it did leading to a $15 api credit deduction.

whh a day ago [ - ]

Getting AI/ML to acknowledge "I don't know" is such a challenge.

jgalt212 a day ago [ - ]

Not true regarding ML, most ML methods support RMSE even if they are non parametric methods.

janalsncm 20 hours ago [ - ]

RMSE is just an extrapolation from the training data. If the data is wrong because the world changed, any model (parametric or not) can be confidently incorrect.

taneq 19 hours ago [ - ]

This is why the world model approach is so important. It allows you to feed back the prediction accuracy of the model to itself at training time, enabling it to predict (to some degree) its own uncertainty. If you jump through a couple of hoops you can also do this at run time to give it “spidey sense” that something’s not right with current inference.

chorizo 17 hours ago [ - ]

I built a little research dashboard that monitors new papers from specific research labs. There is a paper ingest skill that writes Python scripts using pdfplumber to dismantle pdf’s. I have also used it to fetch supplementary information to replicate/augment the published tables. It can also use plotdigitizer to extract raw data from plots.

hodgehog11 17 hours ago [ - ]

The PDF reader for Gemini is extraordinarily poor in my experience. I like the writing style of this model a little better, but for most tasks people would use AI for, Gemini is probably not what you want to be using.

trees101 17 hours ago [ - ]

what is a good way to read PDFs using AI?

seanhunter 15 hours ago [ - ]

In my experience it really depends on what sort of pdfs you are trying to extract (ie what the content is).

For regular pdfs that have been produced in a “normal way” (ie using latex or a modern application with a “save to pdf” function) will contain the text and for those I’ve had a lot of success on general pdfs using pypdf.

“Image” Pdfs that have been produced via a scan so don’t actually contain a text transcript require actual OCR. At the moment my personal rag pipeline is doing this using a local Gemma4 model (you could use something else).

Either way I do an audit post-ingest where I select a random set of pages and also get the local gemma model to try those same set and compare. The symptoms to look out for here will depend a lot on what you’re trying to extract but I’m extracting maths mostly so I get the model to check extraction of symbols, equations etc. One thing I have consistently found useful is to look for “mojibake” (scrambled text caused by decoding in an unintended character encoding) as this almost always catches pdfs that have just extracted as pure garbage. I added this step because I was ingesting a lot of old maths pdfs which have specialist notation that wasn’t always getting correctly ingested and as they were image pdfs it was coming in as pure garbage. So the fix here is to use a specialist ocr service (I have been using “mathpix” which has been great and isn’t too expensive if you don’t want to do too much).

The other thing that can cause problems is things like tables (eg if you were trying to ingest a lot of pdfs like financials of companies etc). Those can cause problems for both the ocr and the pure text extraction methods. I don’t have a current recommendation for that because I haven’t done it recently enough and the state of the art has moved a lot. It’s something to be aware of that will require special treatment though.

lxgr 10 hours ago [ - ]

> regular pdfs that have been produced in a “normal way” (ie using latex or a modern application with a “save to pdf” function) will contain the text

Producing "normal PDFs" that way actually requires specific LaTeX options to be enabled in my experience. Without that, PDF viewers have to perform all kinds of ugly hacks to even figure out what Unicode codepoint a given glyph is supposed to represent! PDFs are much more of a vector format than a layouting program than most people seem to realize.

> One thing I have consistently found useful is to look for “mojibake” (scrambled text caused by decoding in an unintended character encoding)

This is exactly the problem with PDFs: It's not regular mojibake (i.e. interpreting a string of text in the wrong charset), but rather some PDF processor's failed attempt at mapping glyphs back to codepoints without an explicit mapping table being present in the PDF, which is something that the creator actively has to do.

> “Image” Pdfs that have been produced via a scan so don’t actually contain a text transcript require actual OCR.

For the reason above and others, in my experience, OCR actually works significantly better than trying to "semantically parse" the PDF.

seanhunter 6 hours ago [ - ]

Hmm. Not sure what I'm doing that's special but both latex pdfs I produce and others that I read generally work just fine with pypdf, and I really am not adding any flags at all (my makefile says I just go

   latexmk --lualatex -aux-directory=output -output-directory=output $<

). Maybe latexmk is adding some magic?

lxgr 6 hours ago [ - ]

\usepackage{cmap} is usually what does that:

> The cmap package provides character map tables, which make PDF files generated by pdfLATEX both searchable and copy-able in acrobat reader and other compliant PDF viewers.

(from https://ctan.org/pkg/cmap)

seanhunter 5 hours ago [ - ]

I don't use cmap or pdflatex. Weird.

lostsock 16 hours ago [ - ]

I have a standing instruction for any documents that can't natively be read by a given AI to first be converted into .md using https://github.com/microsoft/markitdown which I've found to work really well

wwn_se 13 hours ago [ - ]

Doing a preprocess using some pdf extraction and ocr tool and then feeding that to the big model is usually way more stable.

chrsw 9 hours ago [ - ]

In the broadest sense, I don't think we're there yet. I asked an SoC vendor to provide their chip documentation in Markdown. They refused. So, I went ahead and tried to do myself with AI.

I tried various AI tools and the results ranged from absolute garbage to something-but-not-something-but-not-quite.

I went ahead and did a section of a huge PDF by hand, just to see if what I was asking for was even feasible. After more than several hours of painstaking work spread across multiple days, I got several chapters to look identical to the source PDF in some Markdown renderers. I had to use some HTML for the more complex tables. I converted some diagrams to Markdown and some to images linked to from the Markdown.

rawoke083600 12 hours ago [ - ]

MinerU works well to get it markdown

fsmv a day ago [ - ]

You should just have it OCR a screenshot of the PDF that would probably work better

0xbadcafebee 16 hours ago [ - ]

It does this pretty often. Gemini is an "intelligent" model, but it's massively nerfed and so isn't useful for real work. If you use it with an agent harness, you need to design the harness to detect this and start a new session. Once it nerfs itself it won't try again.

BobbyTables2 20 hours ago [ - ]

Tabula + Excel could probably do it quicker.

jwrallie 15 hours ago [ - ]

The table select option in Okular is also great, as you can manually rearrange the divisions. For low volume, of course. Tabula will work better otherwise. I also suggest Libreoffice Calc, the .csv support is leagues ahead of Excel.

staticman2 a day ago [ - ]

You didn't say whether you were using the App but the App's performance seems to be severely throttled compared to API.

anitil 20 hours ago [ - ]

My go-to for this is to screenshot and use the built-in text extraction in the screenshot tool (I'm on a mac), then pass on that text data to whatever processing. It's a pretty good tool so long as the PDF is in OK shape (I've had errors in scanned images).

nradov 20 hours ago [ - ]

It's so horrible that in 2026 people are still publishing important data and specifications in a format like PDF that's difficult for LLMs to consume. We need to drag them kicking and screaming to HTML or Markdown. Heck, even Microsoft Word DOCX is superior for reliable parsing and content extraction.

dannyw 12 hours ago [ - ]

Good luck, getting rid of PDFs is going to be as hard as migrating from JPEG everywhere.

jjice a day ago [ - ]

I haven't heard any accounts of it doing that since Gemini 2.5, but it was pretty easy to get it to do it with a programming task back then after a few failed attempts. Very interesting to hear it'll still do it.

staindk a day ago [ - ]

We've been quite impressed with GCP Document AI. Not sure if it has a free tier but perhaps that's where Google is putting all the good OCR.

mjcohen 20 hours ago [ - ]

Years ago, I used Acrobat to extract tables from a PDF. Had to do it manually, but it pasted nicely into Excel.

citizenpaul 5 hours ago [ - ]

This comment appears to

1. be made up.

2. have successfully nerd sniped HN

To what purpose I'm not sure.Testing an LLM bot?

suuuuuuuu 21 hours ago [ - ]

I envy you that it admitted that rather than simply making up data and lying about it.

nimchimpsky a day ago [ - ]

[dead]