For the most part I think we get the benchmarks we deserve.
Many SWE-bench passing PRs would not be merged: https://news.ycombinator.com/item?id=47341645
Top model SWE bench scores may be skewed by git history leaks: https://news.ycombinator.com/item?id=45214670