As usual with LLMs. In my experience, all those metrics are useful mainly to tell which models are definitely bad, but doesn't tell you much about which ones are good, and especially not how the good ones stack against each other in real world use cases.
Andrej Karpathy famously quipped that he only trusts two LLM evals: Chatbot Arena (which has humans blindly compare and score responses), and the r/LocalLLaMA comment section.
Lmarena isn't that useful anymore lol
I actually agree with that, but it's generally better than other scores. Also, the quote is like a year old at this point.
In practice you have to evaluate the models yourself for any non-trivial task.