These can be useful for labs training models. I don't see them as particularly valuable for building AI systems. Real performance depends on how the system is built, much more so than the underlying LLM.
Evaluating the system you build on relevant inputs is most important. Beyond that it would be nice to see benchmarks that give guidance on how and LLM should be used as a system component, not just which is "better" at something.
My thoughts were this. The moment it is public it’s probably in the training data set. The real evals are the ones that you have to make an a problem you’re trying to solve and the data you are using.
I see there are lots of courses being sold for Evals in Maven. Some are as costly as USD 3500. Are they worth it? https://maven.com/parlance-labs/evals
Move beyond benchmarks… proceed to list a bunch of benchmarks.
The problem for me is that it’s not worth running these myself, yeah I may pay attention to which model is better at tool calling. But what matters is how well it does at my use case.
I’ve been building a tool to help with this - Safety Evals In-a-Box [https://github.com/elemeno/seibox]. It’s a work in progress and not quite ready for public release, but its a multi-model eval runner (primarily for safety oriented evals, but no reason why it can run other types as well!) and includes cost and latency in it reporting.
These can be useful for labs training models. I don't see them as particularly valuable for building AI systems. Real performance depends on how the system is built, much more so than the underlying LLM.
Evaluating the system you build on relevant inputs is most important. Beyond that it would be nice to see benchmarks that give guidance on how and LLM should be used as a system component, not just which is "better" at something.
My thoughts were this. The moment it is public it’s probably in the training data set. The real evals are the ones that you have to make an a problem you’re trying to solve and the data you are using.
I see there are lots of courses being sold for Evals in Maven. Some are as costly as USD 3500. Are they worth it? https://maven.com/parlance-labs/evals
Move beyond benchmarks… proceed to list a bunch of benchmarks.
The problem for me is that it’s not worth running these myself, yeah I may pay attention to which model is better at tool calling. But what matters is how well it does at my use case.
I think cost should also be a direct consideration. Model performance varies wildly on benchmarks when given a budget. https://substack.com/@andrewplassard/note/p-173487568?r=2fqo...
I’ve been building a tool to help with this - Safety Evals In-a-Box [https://github.com/elemeno/seibox]. It’s a work in progress and not quite ready for public release, but its a multi-model eval runner (primarily for safety oriented evals, but no reason why it can run other types as well!) and includes cost and latency in it reporting.
How can the community tell if models overfit to these benchmarks?
By the composition of evals. Plus secondary metrics like parameter size, and token cost.
Not perfect, but useful.