Does this also extract semantic relationships and data dependencies between fields?
In the past I'd built an internal tool that transforms insurance PDFs to structured data. I wanted to extract explicit data dependencies between fields to perform validation.
Insurance forms can sometimes have 30-40 pages and they can have fields on page 40 that depend on fields on page 4 with a few nested if conditions. Would Parsewise be able to extract those relationships?
If yes, how do you do it for large documents?
Yes, we do it by having multiple stages to the pipeline. First we would extract the independent data points (from say both page 4 and 40) and a second pass step establishes relationship (we call this resolution).
On the scale aspect, because we go in multiple passes, we break the scope into small enough pieces and then build it back up in a later step. Iirc the largest document I've seen a customer use was over 1k pages.
There are more complex data dependency scenarios where we find that the data that's extracted and combined (e.g. from page 4 and 40), needs to then be further transformed in different ways (e.g. having an evaluation and a clarification outcome at the end). To make these be aligned in value we are soon releasing a feature for what we call derived agents.
1. Incredible! Can I make an unsolicited ask? If you had industry specific templates for standardized PDFs it would be easier for me to send Parsewise to the insurance companies I'd worked for. Something similar to https://www.useanvil.com/forms/?type=pdf-templates but with your clean, semantic data model.
2. Can I ask how? When I was building something like this, I realized there's an element of burning tokens for correctness. Meaning, splitting things into small units and small processes, each using a separate LLM output to be later combined. For a 1k page document, what kind of token usage do you see?
Re 1 - that is a very kind offer! Our current public template library is very limited, so let me come back to you on this.
2. We see exactly the same thing. There is a trade-off in correctness vs token burning. However, some tokens (models) are cheaper and faster than others, so the small pieces can benefit from that. The token usage is also surprisingly variable, because it depends on the information density of the document and also on the information density of the question (e.g. is it a single needle in a haystack or are we analyzing the entire haystack from 10 perspectives). So the parsing for 1k pages may be on the order of millions of tokens, while a series of queries (extractions) on top of it could be 1-2 orders of magnitude more.