Same with LLM benchmarks these days.

Well, the pelican benchmark is easily verifiable.

Kind of hard to judge though, it’s not really objective how good a pelican looks.

Or a bicycle!