> Sounds like "I don't know programming, so I will just use AI".
If you were leading Tensorlake, running on early stage VC with only 10 employees (https://pitchbook.com/profiles/company/594250-75), you'd focus all your resources on shipping products quickly, iterating over unseen customer needs that could make the business skyrocket, and making your customers so happy that they tell everyone and buy lots more licenses.
Because you're a stellar tech leader and strategist, you wouldn't waste a penny reinventing low-level plumbing that's available off-the-shelf, either cheaply or as free OSS. You'd be thinking about the inevitable opportunity costs: If I build X then I can't build Y, simply because a tiny startup doesn't have enough resources to build X and Y. You'd quickly conclude that building a homegrown, robust PDF parser would be an open-ended tar pit that precludes us from focusing on making our customers happy and growing the business.
And the rest of us would watch in awe, seeing truly great tech leadership at work, making it all look easy.
I would hire someone who understands PDFs instead of doing the equivalent of printing a digital document and scanning it for "digital record keeping". Stop everything and hire someone who understands the basics of data processing and some PDF.
What's the economic justification?
Let's assume we have a staff of 10 and they're fully allocated to committed features and deadlines, so they can't be shifted elsewhere. You're the CTO and you ask the BOD for another $150k/y (fully burdened) + equity to hire a new developer with PDF skills.
The COB asks you directly: "You can get a battle-tested PDF parser off-the-shelf for little or no cost. We're not in the PDF parser business, and we know that building a robust PDF parser is an open-ended project, because real-world PDFs are so gross inside. Why are you asking for new money to build our own PDF parser? What's your economic argument?"
And the killer question comes next: "Why aren't you spending that $150k/y on building functionality that our customers need?" If don't give a convincing business justification, you're shoved out the door because, as a CTO, your job is building technology that satisfies the business objectives.
So CTO, what's your economic answer?
The mistake all of you're making is the assumption that PDF rendering means rasteration. Everything else crumbles down from that misconception.
So if you receive a pdf full of sections containing prerasterized text (e.g adverts, 3d rendered text with image effects, scanned documents, handwritten errata), what do you do? You cannot use OCR because apparently only pdf-illiterate idiots would try such a thing?
I wouldn't start by rastering the rest of the PDF. In business world, unlike academia and bootleg books and file sharing, majority of PDFs are computer generated. I know because I do this for a living.