Results similar to mythos have been duplicated by weaker models.
Think it's more a care of mythos raising widespread awareness that tireless LLMs can be weaponized to dig through code and find that one tiny flaw nobody spotted
Results similar to mythos have been duplicated by weaker models.
Think it's more a care of mythos raising widespread awareness that tireless LLMs can be weaponized to dig through code and find that one tiny flaw nobody spotted
The report I saw kind of seemed to be pointing at a flaw and asking "do you see it?" which is not the same thing. I felt a pretty large difference between Opus 4.6's results and Mythos's, so I would be surprised if even weaker models did anywhere near as well. I'd like to see these results, if they are using a decent methodology.
Of course, even the reports with flawed methodology could be suggesting that a great harness + weak model might achieve a similar level of results as a mediocre harness + strong model. But I'd want to see solid evidence for that.