For the most part I think we get the benchmarks we deserve.

Many SWE-bench passing PRs would not be merged: https://news.ycombinator.com/item?id=47341645

Top model SWE bench scores may be skewed by git history leaks: https://news.ycombinator.com/item?id=45214670