This benchmark draws a very different picture having GPT5.5 on the very top with 70% and DeepSeek at 8%
https://deepswe.datacurve.ai
DeepSWE has been heavily criticized though. https://github.com/datacurve-ai/deep-swe/issues/21 Putting GPT 5.5 on top is the obviously correct part, but everything else about it makes very little sense.
DeepSWE has been heavily criticized though. https://github.com/datacurve-ai/deep-swe/issues/21 Putting GPT 5.5 on top is the obviously correct part, but everything else about it makes very little sense.