How is everyone monitoring the skill/utility of all these different models? I am overwhelmed by how many they are, and the challenge of monitoring their capability across so many different modalities.
How is everyone monitoring the skill/utility of all these different models? I am overwhelmed by how many they are, and the challenge of monitoring their capability across so many different modalities.
https://www.swebench.com
https://swe-rebench.com
https://livebench.ai/#/
https://eqbench.com/#
https://contextarena.ai/?needles=8
https://metr.org/blog/2025-03-19-measuring-ai-ability-to-com...
https://artificialanalysis.ai/leaderboards/models
https://gorilla.cs.berkeley.edu/leaderboard.html
https://github.com/lechmazur/confabulations
https://dubesor.de/benchtable
https://help.kagi.com/kagi/ai/llm-benchmark.html
https://huggingface.co/spaces/DontPlanToEnd/UGI-Leaderboard
I’d stick to artificial analysis
That has many of its own problems as well.
This is the best summary, in my opinion. You can also see the individual scores on the benchmarks they use to compute their overall scores.
It's nice and simple in the overview mode though. Breaks it down into an intelligence ranking, a coding ranking, and an agentic ranking.
https://artificialanalysis.ai/
Unfortunately it's completely unusable on mobile
Works fine for me, but you could also just turn on desktop view in your mobile browser if it isn't big enough on your screen.
I use Firefox Mobile, so perhaps there is a difference on Chromium-based browsers?