Hacker News

The report I saw kind of seemed to be pointing at a flaw and asking "do you see it?" which is not the same thing. I felt a pretty large difference between Opus 4.6's results and Mythos's, so I would be surprised if even weaker models did anywhere near as well. I'd like to see these results, if they are using a decent methodology.

Of course, even the reports with flawed methodology could be suggesting that a great harness + weak model might achieve a similar level of results as a mediocre harness + strong model. But I'd want to see solid evidence for that.