Spend a hour or an afternoon creating your own eval harness with problems or workloads from your private repos or personal projects.
Use frontier LLMs to help create the harness and identify problems, but put in the effort to ensure your verifier is actually good and robust.
Then you have your own private benchmark, which makes new model releases a breeze instead of purely vibes or contaminated public benchmarks.
For extra props, add things you care about; such as reliability (eg deliberate noise injection, simple typo introduction in problems, variants, running each test multiple times).
At the end of the day however, the best LLM is the one you’re the most productive in. Frontier intelligence might be the main factor, but far from the only factor:
• How fast is it in the real world? How well does it understand your general style of prompting / guidance?
• How consistent and reliable is it? Does it exhibit laziness / hallucination of performing actions (and saying it does) it never performed?
• etc.