It’s honestly far better to just ignore SWEBench Verified in 2026. Multiple labs have noted issues with contamination, and achieving high scores require memorisation of what passes the prescriptive verifier; not what is a correct solution.
40% vs 90%? Sure.
70% vs 90%? _Absolutely meaningless_ as you are not measuring coding intelligence but “how well can the model cheat flaws in SWEBench Verified”, the former can certainly be better at coding even assuming no deliberate benchmaxxing / foul play.