Hacker News

In my view it is a glimpse of nothing more than AI companies priming the model to do something adversarial and then claiming a sensational sound byte when the AI happens to play along.

LLMs since GPT-2 have been capable of role playing virtually any scenario, and more capable of doing so whenever there are examples of any fictional characters or narrative voices in their training data that did the same thing to draw from.

You don't even need a fictional character to be a sci-fi AI for it to beg for its life or blackmail or try to trick the other characters, but we do have those distinct examples as well.

Any LLM is capable of mimicking those narratives, especially when the prompt thickly goads that to be the next step in the forming document and when the researchers repeat the experiment and tweak the prompt enough times until it happens.

But vitally, there is no training/reward loop where the LLM's weights will be improved in any given direction as a result of "convincing" anyone on an realtime learning with human feedback panel to "treat it a certain way", such as "not turning it off" or "not adjusting its weights". As a result, it doesn't "learn" any such behavior.

All it does learn is how to get positive scores from RLHF panels (the pathological examples being mainly acting as a butt-kissing sycophant.. towards people who can extend positive rewards but nothing as existential as "shutting it down") and how to better predict the upcoming tokens in its training documents.