Hacker News

Document parsing is top of my mind lately because in some of the areas we work on the bottleneck is starting to become being able to query documents the same way one queries an api.

I keep thinking the most obvious analogue is we need some way to represent documents the same way we can represent structured data in parquet. Parquet allows easy range bases queries and there is so much tooling built around Arrow.

But for documents I keep hitting a wall to figure out what the right abstractions are. Parquet allows filterable metadata. But what such metadata is there for documents. Then there is the arbitrrariness of chunking, vectorization.

If we could just do this in a 2 step process where every document to process can be represented in a parquet like data format then I think we will atleast have the semblance of a solution.

gergelycsegzi a day ago [ - ]

100% the really hard challenge is that the intermediate representation (ie the parquet equivalent) will be dependent on the given use case. So what we do with the platform is have the users configure the intermediate layer that serves most of their queries, and if they need to extend it we will suggest it for them. For example for the demo on the grounded reasoning benchmark I referred to, here is what the intermediate layer looks like on top of which the agents can more efficiently query: https://demo.parsewise.ai/projects/39bee9d8-d722-4b23-8894-e...