Goodhart’s Law in reverse, what can’t be gamed gets rejected.
You've almost buffer overrun Goodhart's Law into the https://en.wikipedia.org/wiki/McNamara_fallacy . :]
SWE-bench verified was created in collaboration with OpenAI. It's also an open dataset so prone to contamination, meaning it can be gamed.
You've almost buffer overrun Goodhart's Law into the https://en.wikipedia.org/wiki/McNamara_fallacy . :]
SWE-bench verified was created in collaboration with OpenAI. It's also an open dataset so prone to contamination, meaning it can be gamed.