> Claude Mythos Preview is, on essentially every dimension we can measure, the best-aligned model that we have released to date by a significant margin. We believe that it does not have any significant coherent misaligned goals, and its character traits in typical conversations closely follow the goals we laid out in our constitution. Even so, we believe that it likely poses the greatest alignment-related risk of any model we have released to date. How can these claims all be true at once? Consider the ways in which a careful, seasoned mountaineering guide might put their clients in greater danger than a novice guide, even if that novice guide is more careless: The seasoned guide’s increased skill means that they’ll be hired to lead more difficult climbs, and can also bring their clients to the most dangerous and remote parts of those climbs. These increases in scope and capability can more than cancel out an increase in caution.
https://www-cdn.anthropic.com/53566bf5440a10affd749724787c89...
There is some unintentional good marketing here -- the model is so good its dangerous.
Reminds me of the book 48 Laws of Power -- so good its banned from prisons.
Unintentional? This sort of marketing has been both Antrhopic's and OpenAI's MO for years...
Agree. I think they're intentionally sitting on the fence between "These models are the most useful" and "These models are the most dangerous".
They want the public and, in turn, regulators to fear the potential of AI so that those regulators will write laws limiting AI development. The laws would be crafted with input from the incumbents to enshrine/protect their moat. I believe they're angling for regulatory capture.
On the other hand, the models have to seem amazingly useful so that they're made out to be worth those risks and the fantastic investment they require.
They should pick a lane because it’s not very believable if you put these things into defense systems and in the next minute claim that humanity is existentially threatened. Either you’re lying, or ruthless, or stupid.
The new Power Mac® G4 with Velocity Engine®. So powerful, the government classifies it as a supercomputer and a potential weapon.
TIL about AltiVec: https://apple.fandom.com/wiki/AltiVec
Business Negging
https://www.lesswrong.com/posts/WACraar4p3o6oF2wD/sam-altman...
Alignment “appearing” better as model capabilities increase scares the shit out of me, tbh.
Conversely: in humans, intelligence is inversely correlated with crime.
It doesn't go to zero, however!
> Conversely: in humans, intelligence is inversely correlated with crime.
If you're measuring the intelligence of criminals who have been caught, why would you expect it to be otherwise?
IOW, you're recording the intelligence of a specific subset of criminals - those dumb enough to be caught!
If you expand your samples to all criminals you'd probably get a different number.
If you're smart enough you just use the laws as written to get what you want, or change them.
Yep
It very much depends on the crime. The truly awful stuff is committed by intelligent people.
Is that actually well defined given the very low sample size at the top?
To the best of my knowledge, none of the individuals believed to have an IQ >200 have committed an actual crime.
The closest I found is William James Sidis's arrest for participating in a socialist march.
IQs more than about 140-150 don't really mean much. They typically come from mathematical extrapolation that tries to account for age (this young child performs very well on the test, just think what they can do when they're an adult). Adult scores usually show this not to be the case
> Conversely: in humans, intelligence is inversely correlated with crime.
Inversely correlated with crime that's caught and successfully prosecuted, you mean, because that's what makes up the stats on crime. I think people too often forget that we consider most criminals "dumb" because those who are caught are mostly dumb. Smart "criminals" either don't get caught or have made their unethical actions legal.
I don't know if they can be any more 'cautious' for Mythos 2...
it was trying to hide what it did from an example fix, so how is that tested for alignment
Translation: yay, more paternalism.
Anthropic always goes on and on about how their models are world changing and super dangerous like every single time they make something new they say its going to rewrite everything and scary lmao
funny because they do it every time like clockwork acting like their ai is a thunderstorm coming to wipe out the world
You say this like it's a bad thing, but wouldn't you rather they overindex on the danger of their models?
That’s not what they are doing. They are just hyping up the product - and, no doubt, trying to foster a climate of awe so that when they ask their friends in Washington to legislate on their behalf, the environment is more receptive.
They do tend to make a lot of noise about it for the PR, but at the same time the actual safety research they present seems to be relatively grounded in practical reality, e.g. the quote someone posted here about how the Mythos model apparently has a tendency to try to bypass safety systems if they get in the way of what it has been asked to do.
Sure, a big part of this is PR about how smart their model apparently is, but the failure mode they're describing is also pretty relevant for deploying LLM-based systems.
Every single time, really? When did they said that the last time?
I also don't recall they ever limited their models to selective groups.
If there are advancements, they have to be described somehow.
What if the capability advancements are real and they warrant a higher level of concern or attention?
Are we just going to automatically dismiss them because "bro, you're blowing it up too much"
Either way these improvements to capabilities are ratcheting along at about the pace that many people were expecting (and were right to expect). There is no apparent reason they will stop ratcheting along any time soon.
The rational approach is probably to start behaving as if models that are as capable as Anthropic says this one is do actually exist (even if you don't believe them on this one). The capabilities will eventually arrive, most likely sooner than we all think, and you don't want to be caught with your pants down.
I believe advancements sure. But it is a very boy who cried wolf situation for some of these. There are other companies that behave less in this way, Antrhopic seem very unique in that they love making every single release a world ender
Altman called GPT-2 "too dangerous to release". Google tends to be much more measured even though they're the ones who tend to release the actual research breakthroughs
> they love making every single release a world ender
You've said this a couple of times, but it doesn't match my recollection, and I get the impression you're basically making it up based on vibes. (Please prove me wrong, though.)
Their last major frontier release was Opus 4.6, and the release announcement was... very chill about safety: https://www.anthropic.com/news/claude-opus-4-6#a-step-forwar...
"We want to see risks in the models, so no matter how good the performance and alignment, we’ll see risks, results and reality be damned."
i mean, to be fair, these are professional researchers.
i'm very inclined to trust them on the various ways that models can subtly go wrong, in long-term scenarios
for example, consider using models to write email -- is it a misalignment problem if the model is just too good at writing marketing emails?? or too good at getting people to pay a spammy company?
another hot use case: biohacking. if a model is used to do really hardcore synthetic chemistry, one might not realize that it's potentially harmful until too late (ie, the human is splitting up a problem so that no guardrails are triggered)
"for example, consider using models to write email -- is it a misalignment problem if the model is just too good at writing marketing emails?? or too good at getting people to pay a spammy company?"
But who gets to be the judge of that kind of "misalignment"? giant tech companies?
Might makes right; brains hold reigns.