we've run a few tests for this. The first was quite innovative - we created a mapping of method-to-target-problem by using the papers themselves because each paper says which problems it's methods target. Then we checked whether our recommendation followed that mapping or not.
The best test was of course to actually run coding agents without and with our mcp server, we saw performance improvements on the metrics that the user requested for on 100+ such real scenarios, so this was convincing enough then.