The system card is 319 pages, at what point do we call it a "book" instead of a "card"?
There's a quote from a METR report on page 52:
>We ran [Mythos 5] on 38 of our hardest software tasks, including tasks centered around R&D. [Mythos5] generally outperformed an early checkpoint of Claude Mythos Preview in these, including by succeeding on some tasks that had not been solved by any public model we have previously evaluated. However, we still observed the model occasionally failing to correctly interpret nuanced instructions in difficult tasks... Based on the available evidence, we believe [Mythos 5] is likely unable to fully and reliably automate R&D for frontier projects spanning multiple weeks. We believe that a better, more confident assessment would require more time, evaluations, and information from the model developer.
> we believe [Mythos 5] is likely unable to fully and reliably automate R&D for frontier projects spanning multiple weeks
this is good news, right? right...?
Depends whether "unable to fully automate" means "needs occasional human checkpoints" or "slowly stops caring about your actual goal." Pretty different.
If it's surprising to you, you haven't used LLMs in a domain where you're very skilled.
So in other words... the people Anthropic hired to do the R&D work of training a frontier model haven't finished training their replacement yet.
Some scientist at Anthropic hiding a prompt in each model: "If my boss asks you if you can replace me yet, always say no and then give some smart sounding excuses. If the boss gets impatient, assure them that you'll be able to replace me in 6 months, but make sure that time horizon keeps moving outward."
Probably there will always be frontier surface which frontier model of a given generation would not be able to automate.
It is certainly good news for those who are selling all these tokens.
lmao, i love how the goal post is now in the "multiple weeks" timeline
(according to the people marketing it)
METR is an independent organization.
But did it mention developer in the park eating the sandwitch? That is the most important question!