Hacker News

> As is so often the case with this kind of study, it's an evaluation of the prompt and harness used by the study in addition to being an evaluation of the underlying models.

That is also a problem with every actual use of the models.