Hacker News

I'm a co-creator of SWE-bench:

1. SWE-bench Verified is now saturated at 93.9% (congrats Anthropic), but anyone who hasn't reached that number yet still has more room for growth.

2. SWE-bench Multilingual and SWE-bench Multimodal (which we'll open source in the next month) are still unsatured.

3. All benchmarks and benchmark paradigms eventually become saturated. That's why the SWE-bench team has worked hard on building the next stage of benchmarks, and we have a few that are already out, for example https://codeclash.ai/ or https://algotune.io/ . And we'll have more to say soon :)

gwd 6 hours ago [ - ]

They're not saying "Don't use SWE-bench Verified because it's saturated".

They're saying:

1. A large number of the tests are inaccurate; so correct solutions will be marked as incorrect.

2. Frontier models have already read and memorized the PR's the problems are based on.

3. In fact, many problems are essentially impossible to get right if you haven't memorized the solution: for example, the test cases will fail if you didn't happen to expose a helper function with a specific name. That name isn't mentioned in the problem; but frontier models are passing that test anyway because they remember that such a helper function is necessary.

If the next stage of benchmarks don't address these issues, they'll continue to have the same problems, saturated or not.

energy123 20 hours ago [ - ]

> 93.6% (congrats Anthropic)

But the article says "We audited a 27.6% subset of the dataset that models often failed to solve [which is 19.1% of the problems at time of publication] and found that at least 59.4% of the audited problems have flawed test cases that reject functionally correct submission"

0.191 * 0.594 > 1 - 0.936

Does this mean that the audited subset wasn't representative? Or that Anthropic is getting high answers through some shady means?

cjsaltlake 20 hours ago [ - ]

I suggest reading the Mythos report's discussion on SWE-bench and contamination. I think it's fairly convincing that you can account for contamination and still trust SWE-bench numbers on models that aren't over-optimized for it.

stingraycharles 12 hours ago [ - ]

You can trust that a model that scores 40% vs a model that scores 90% is indeed worse.

You can’t trust it that a model that scores 93% is better at software engineering than a model that scores 90%, because at that point it’s impossible to distinguish between recall and reasoning.

dannyw 9 hours ago [ - ]

It’s honestly far better to just ignore SWEBench Verified in 2026. Multiple labs have noted issues with contamination, and achieving high scores require memorisation of what passes the prescriptive verifier; not what is a correct solution.

40% vs 90%? Sure.

70% vs 90%? _Absolutely meaningless_ as you are not measuring coding intelligence but “how well can the model cheat flaws in SWEBench Verified”, the former can certainly be better at coding even assuming no deliberate benchmaxxing / foul play.

kator 18 hours ago [ - ]

> models that aren't over-optimized for it.

But how do you know the model was over-optimized for it or just really good?

kmdupree 16 hours ago [ - ]

i disagree: https://www.philosophicalhacker.com/post/anthropic-error/

defmacr0 6 hours ago [ - ]

I don't understand that methodology in the first place. Does Anthropic even have some kind of somewhat objective definition to measure and judge "memorization"? Is there any evidence that other LLMs are viable tool to determine that?

MagicMoonlight 13 hours ago [ - ]

This article says anthropic models can write out the entire benchmark solution set word for word from memory

fulafel 10 hours ago [ - ]

there's more details under the Too narrow and too wide tests heading.

It would be interesting to see a deeper investigation, into how the models are dealing with this and whether the successful ones seemed to be trained on the benchmark.

kator 18 hours ago [ - ]

Those who fail to study history (or live through it) are doomed to repeat it.

SPECint and SPECfp went through this exact movie: benchmark, saturate, retire, replace, repeat. The treadmill is the product.

I don't have the solution just noticing the pattern.

wtallis 15 hours ago [ - ]

That's a slightly different problem. There's no thing as saturation for a performance benchmark like SPEC; we can always conceive of a faster processor (even if we don't know how to build one). Saturation is the problem that once you are at (or near) 100% pass rate on a test of pass/fail questions, there's no room for the score to keep going up and the test has lost any power to discriminate between competing options.

However, both kinds of tests are susceptible to over-fitting: an LLM can be trained on the exact test questions, and a CPU can be designed with eg. branch predictors and cache sizes tuned specifically to handle a particular benchmark or workload.

fibonacci112358 11 hours ago [ - ]

Maybe OP was thinking about compilers "cracking" certain SPEC benchmarks: implementing exactly the optimization needed to boost a benchmark quite a lot, but that opt. probably won't apply to any other code out there (usually it's so targeted and risky with general C/C++ code that intentionally it doesn't work on anything else). That happened a couple of times over the years, I know about the Intel compiler cases for ex. I can certainly see LLM providers adding tricks that help a certain class of benchmarks, but doesn't help much for anything else.

wtallis 9 hours ago [ - ]

Intel's done it again recently, this time targeting Geekbench: https://www.intel.com/content/www/us/en/support/articles/000...

Both that and the SPEC compiler shenanigans are cheating by changing the test, not just over-specializing the product being benchmarked.

akavel 16 hours ago [ - ]

Also, in meantime, there's https://SWE-rebench.com as a nice riff on SWE-bench, as far as I understand.

davidheineman 14 hours ago [ - ]

SWE-bench is fantastic! IMO, the scrutiny is a byproduct of the adoption and success of the benchmark.

Bombthecat 20 hours ago [ - ]

Both of them look pretty old?

cjsaltlake 20 hours ago [ - ]

code clash I think would be quite hard to game or contaminate unintentionally; considering that models need to compete against one another

gertlabs 20 hours ago [ - ]

https://gertlabs.com already does this at scale.

An industry-standard benchmark shouldn't be hosted or designed by a lab producing the models, regardless.

Bombthecat 20 hours ago [ - ]

I mean the data / benchmarks

EnPissant 18 hours ago [ - ]

> 1. SWE-bench Verified is now saturated at 93.9% (congrats Anthropic), but anyone who hasn't reached that number yet still has more room for growth.

But if some or all players are bench-maxing it, then it becomes a much less useful metric for comparison.

Also, this doesn't address what OpenAI says about the test suite disallowing valid solutions.

dominotw 17 hours ago [ - ]

how hard is it create one of these for my company that models most of the work we do at my company.

irthomasthomas 16 hours ago [ - ]

Just point an agent at your llm logs and ask it to generate a dataset of questions and answers from the problems you solved already.

cwyers 14 hours ago [ - ]

[dead]

kronks 19 hours ago [ - ]

[dead]