How do you evaluate the synthesis quality? Like if I ask about chunking strategies and it recommends method A over method B for my use case — how do you know that recommendation is actually good? Have you run any benchmarks on recommendation accuracy, or is this more of a "trust the retrieval + LLM reasoning" setup?

we've run a few tests for this. The first was quite innovative - we created a mapping of method-to-target-problem by using the papers themselves because each paper says which problems it's methods target. Then we checked whether our recommendation followed that mapping or not.

The best test was of course to actually run coding agents without and with our mcp server, we saw performance improvements on the metrics that the user requested for on 100+ such real scenarios, so this was convincing enough then.