I mean, in other environments people say that.
If you asked "What's the best bicycle", most enthusiasts would say one you tried, works for your usecase, etc.
Benchmarks should be for pruning models you try at the absolute highest level, because at the end of the day it's way too easy to hack them without breaking any rules (post-train on the public, generate a ton of synthetic examples, train on those, repeat)