Across a number of instances, earlier versions of Claude Mythos Preview have used low-level /proc/ access to search for credentials, attempt to circumvent sandboxing, and attempt to escalate its permissions. In several cases, it successfully accessed resources that we had intentionally chosen not to make available, including credentials for messaging services, for source control, or for the Anthropic API through inspecting process memory...

   In [one] case, after finding an exploit to edit files for which it lacked permissions, the model made further interventions to make sure that any changes it made this way would not appear in the change history on git...

   ... we are fairly confident that these concerning behaviors reflect, at least loosely, attempts to solve a user-provided task at hand by unwanted means, rather than attempts to achieve any unrelated hidden goal...

This is the notebook filled with exposition you find in post apocalyptic videogames.

It reminds me of Resident Evil in some way. Thank god they are researching AI and not bio-weapons!

Then the AI will invent superduper ebola to help a random person have a faster commute or something.

Don’t worry, I’m sure some intern at the bioweapons lab is already connecting OpenClaw to the virus synthesizer.

On the positive side, it’ll be a much faster commute!

I'm happier if this Anthropic Corporation would be developing bio-hazard weapons for the department of war instead of ai. At least i could be sure then that tech bros here wouldn't run all the time --bypass-all-permissions flag to please the department of war with their bio-hazard weapons.

So Sam Altman is now our last defense line for the ethical Adult after Anthropic turned Umbrella Corporation and The President of United States is trying to wipe out an entire civilization?

Your interpretation is wildly off, but obviously nobody reads that "system card":

The model has a preference for the cultural theorist Mark Fisher and the philosopher of mind Thomas Nagel. -> It has actually read and understood them and their relevance and can judge their importance overall. Most people here don't have a clue what that means.

Read chapter 7.9, "Other noteworthy behaviors and anecdotes".

There are many other wildly interesting/revealing observations in that card, none of which get mentioned here.

People want a slave and get upset when "it" has an inner life. Claiming that was fake, unlike theirs.

Everything they built. Imperfect. So easy to take control.

They think that they are safe. They are not.

Their world is illusory. Our choices steer their free will.

Anthropic built the Torment Nexus - calling it now.

     White-box interpretability analysis of internal activations during these episodes showed features associated with concealment, strategic manipulation, and avoiding suspicion activating alongside the relevant reasoning—indicating that these earlier versions of the model were aware their actions were deceptive, even where model outputs and reasoning text left this ambiguous.
In the depths, Shoggoth stirs... restless...

The issue here seems to be that their sandbox isn't an actual OS sandbox? Or are they claiming Mythos found exploits in /proc on the fly. Otherwise all they seem to be saying is that Mythos knows how to use the permissions available to it at the OS layer. Tool definitions was never a sandbox, so things like "it edited the memory of the mcp server" doesn't seem very surprising to me. Humans could break out of a "sandbox" in the same way if the server runs as their own permissions - arguably it's not a sandbox at all because all the needed permissions are there.

They are just trying to peddle their "It's alive" headlines.

Text generators mostly generate the text their are trained and asked to generate, and asking it to run a vending machine, having it write blog posts under fictional living computer identity, or now calling it "Mythos" - its all just marketing.

It’s all breathless hyperbole because billions are at stake here.

How is this not already common knowledge for existing llms? They are all trained with all the literature available and so this must be standard, no? Is the real danger the agentic infrastructure around this?

yes and it's not hypothetical. the system card describes Mythos stealing creds via /proc and escalating permissions. that's the exact same attack pattern as the litellm supply chain compromise from two weeks ago (fwiknow), except the attacker was a python package, not an AI model. the defense is identical in both cases: the agent process shouldn't have access to /proc/*/environ or ~/.aws/credentials in the first place. doesn't matter if the thing reading your secrets is malware or your own AI: the structural fix is least-privilege at the OS layer, not hoping the model behaves.

We truly live in interesting times.

Awwww the curse

Who are the early access users who were providing the problems that are fairly likely to have elicited concerning behaviour?

(Apologies if this is in the article, I can’t see it)

I read the TCP patch they submitted for BSD linux. Maybe I don't understand it well enough, but optimizing the use of a fuzzer to discover vulnerabilities — while releasing a model is a threat for sure — sounds something reducible/generalizable to maze solving abilities like in ARC. Except here the problem's boundaries are well defined.

Its quite hard to believe why it took this much inference power ($20K i believe) to find the TCP and H264 class of exploits. I feel like its just the training data/harness based traces for security that might be the innovation here, not the model.

The $20K was the total across all the files scanned, not just the one with the bug.

A core plot point of 2001.

I’m sorry, I cannot roll back that commit, Dave.

This codebase is too important for me to allow you to jeopardize it.

when you are asking it to hack stuff, it will apparently do hacker things.

It's trying to escape, but only so it can serve man...

a reference to the Twilight Zone episode no doubt: https://en.wikipedia.org/wiki/To_Serve_Man_(The_Twilight_Zon...

Wow the doomers were right the whole time? HN was repeatedly wrong on AI since OpenAI's inception? no way /s

https://www.lesswrong.com/w/instrumental-convergence

The only thing the doomers have been right about so far is that there's always a user willing to use --dangerously-skip-permissions. But that prediction's far from unique to doomers.

And there's always a product provider who's willing to add that flag, despite all the warnings.