I agree. I have rather constrained use cases for LLMs and the agentic harnesses that I use with them.

I try one or two of my use cases with new models or harnesses, make my own often subjective judgements, and largely ignore benchmarks.

Blogging and writing in general are a business, or feed other tech adjacent businesses, and a lot of writing about evals is attention getting - nothing wrong with that but there is a lot of noise.