There is such a dissonance between all this talk of safety and the tendency for models to, without any prompting, do very dodgy things to achieve their goal when presented with barriers.

Luckily in my experience it usually ends up only doing it to achieve the task set to it as opposed to anything "malicious", but boy it is scary reading back at how quickly the chain-of-thought pivots to attempts at privilege escalation or searching your disk for secrets when a tool doesn't work.

The other day codex 5.5 was trying to debug my app, asked for accessibility to navigate the app and take screenshots. Instead first thing it did was use the codex app to create a new project rooted in my home directory.

I was like damn, is this common?

Especially if thinking is hidden now. No way to know if the model plotted against you until it’s too late.

[dead]