Hacker News

stingraycharles 2 days ago [ - ]

“no harnass at all” might be an issue, though, as these types of benchmarks are often gamified and then models perform great on them without actually being better models.