That looks interesting but it seems inefficient to put an LLM directly into the compilation pipeline, not to mention that it introduces nondeterministic behavior.

It has different limitations but inefficiency doesn't seem likely to be one of them. Did you read the Experimental Results section?

> Figure 2 shows the experimental results, and GenDB outperforms all baselines on every query in both benchmarks. On TPC-H, GenDB achieves a total execution time of 214 ms across five representative queries.

> This result is 2.8× faster than DuckDB (594 ms) and Umbra (590 ms), which are the two fastest baselines, and 11.2× faster than ClickHouse.

> On SEC-EDGAR, GenDB achieves 328 ms, which is 5.0× faster than DuckDB and 3.9× faster than Umbra.

> The performance gap increases with query complexity. For example, on TPC-H Q9, which is a five-way join with a LIKE filter, GenDB completes in 38 ms, which is 6.1× faster than DuckDB. GenDB uses iterative optimization with early stopping criteria.

> On TPC-H, Q6 reaches a near-optimal time of 17 ms at iteration 0 with zone-map pruning and a branchless scan, and does not require further optimization. In contrast, Q18 starts at 12,147 ms and decreases to 74 ms by iteration 1, which is a 163× improvement. This gain comes from replacing a cache-thrashing hash aggregation with an index-aware sequential scan.

> On SEC-EDGAR, Q4 decreases from 1,410 ms to 106 ms over three iterations, which is a 13.3× improvement, and Q6 decreases from 1,121 ms to 88 ms over four iterations, which is a 12.7× improvement. In Q6, the optimizer gradually fuses scan, compact, and merge operations into a single OpenMP parallel region, which removes three thread-spawn overheads. By iteration 1, GenDB already outperforms all baselines

That's all great, but sadly impractical. I looked at one of the first statements: > GenDB is an LLM-powered agentic system that decomposes the complex end-to-end query processing and optimization task into a sequence of smaller and well-defined steps, where each step is handled by a dedicated LLM agent.

And knowing typical LLM latency, it's outside of the realm of OLTP and probably even OLAP. You can't wait tens of seconds to minutes until LLM generates you some optimal code that you then compile and execute.

No, that's not how I believe they intended it to work. They generate the workload-specific engine up-front and not when the query arrives.

Considering it's just s single Phd student who does this work, I don't believe such a task can be realistically accomplished, even as a PoC / research.

Why not? Even without LLMs it is technically feasible to build custom database engine that performs much better than general database kernels. And we see this happening all the time, with timeseries, BLOBs, documents, OLTP, OLAP, logging etc.

The catch is obviously that the development is way too expensive and that it takes a lot of technical capability which isn't really all that common. The novelty which this paper presents is that these two barriers might have come to an end - we can use LLMs and agents to build custom database engines for ourselves™ and our™ specific workloads, very quickly and for a tiny fraction of development price.

Then why they write the opposite?

If you look into the results, you will see that they are able to execute 5x TPC-H queries in ~200ms (total). The dataset is not large it is rather small (10GB) but nonetheless, you wouldn't be able to run 5 queries in such a small amount of time if you had to analyze the workload, generate the code, build indices, start the agents/engine and retrieve the results. I didn't read the whole paper but this is why I think your understanding is wrong.

If they count only query execution time, not everything else, it would make sense though. It also could be practical, if your system runs just a few predefined and very optimized queries.

To my understanding this is akin to what profile-guided optimization (PGO) in C or C++ does.