Move beyond benchmarks… proceed to list a bunch of benchmarks.

The problem for me is that it’s not worth running these myself, yeah I may pay attention to which model is better at tool calling. But what matters is how well it does at my use case.