Hacker News

Tbf, most of the "real benchmarks" have issues that are just as bad. Assessing LLM performance is just hard

And personal too. Different engineers are using them for different use cases.