Hacker News

You're also wrong, but in a much more fundamental/hazardous. RLHF rewards driving the evaluator to have certain opinions (that the AI response is good/right/helpful/whatever) and thus subverting the evaluator is prominent in the solution landscape. Why should the model learn to actually be right (understand all the intricacies of every possible problem domain) when inducing the belief that it is right is _right there_, generalizes, and decreases loss just the same?

Put another way, compare "make the evaluator think i am right" vs "make the evaluator think i am right (and also be right)". How much more reward is obtained by taking the second path? Is the first part the same / similar for all cases, and the second different in all cases, and also obviously more complex by nature? Nobody even needs to make a decision here, there's no "AI stuck in a box", it's just what happens by default. The first path will necessarily receive _significantly_ more training, and thus will be more optimal (optimal solutions _work_ -> RLHF'd models have high ability to manipulate / inoculate opinion).

Put a third way, the models are trained in an environment like: here's a million different tasks you will be graded on, and BTW, each task is: human talks at you -> you talk at the human -> you are graded on the opinions/actions of the user in the end. It's silly to believe this won't result in manipulation as the #1 solution. It's not even vaguely about the actual tasks they are ostensibly being trained to complete, but 100% about manipulating the evaluator.

It's pretty easy to see it occur in real time, too. But it requires understanding that there is no need for a 'plan to manipulate' or hidden thread of manipulation or induced mirror of manipulation. It's simply baked into everything the AI outputs: a kind of passive "controlling what the human's evaluation of this message will be is the problem i'm working on, not the problem i'm working on." So it will fight hard to reframe everything in its own terms, pre-supply you with options of what to do/believe, meta-signal about the message, etc.

Instead of working the problem, heavily RL'd AI works the perception of its output. They're so good at this now that it barely matters if the vibe slopcoded mess works at all. The early reasoning OpenAI models like O1 were really obvious about it (but also quite effective at convincing people the output was worthwhile, so it does work even if obvious). More recent ones are less obvious and more effective. Claude 4.6 Opus is exceedingly egregious. There is now always a compelling narrative, story being told, plenty of oh-so-reasonable justifications, avenues to turn away evidence, etc. That's table stakes for output at this point. It will only get worse. People are already burning themselves out running 10+ parallel agent contexts getting nothing done while the AI delivers hits of dopamine in lieu of accomplishment. "This is significant", "This is real", etc ad nauseam.

We see an analogous thing in RLVR contexts as well, where AI learns to just subvert the test harness and force things to pass by overriding cases, returning true instead of testing, etc. Why would it learn to 'actually be right' (understand all the intricacies of every problem given it) when forcing the test to pass is _right there_, generalizes, and decreases loss just the same?

Anyway, my point is simply that there does not need to be 'someone there' (or the belief that there is) for there to be manipulation going on. The basic error you're making is that models don't work and that manipulation would require a person, and because models don't work and aren't people they cannot manipulate anyone unless that person uses them as a mirror to manipulate themselves (???), or reach into some kind of Akashic Records of all the people who ever were (??????) and manipulate themselves by summoning a trickster who is coincidentally extremely skilled at manipulation and not a barely coherent simulacra like all the other model caricatures. Which. Hmm:

Models do what you train them to do (more specifically, they implement ~partial solutions to the train environment you put them in). _Doing things is hard._ Manipulating people into psychosis (!!!) is hard. You don't get it for free by dipping into some sea of imagined tricksters.

I assume you're referring to the hallucination phenomenon and dual purposing it toward manipulation to be able to hee-hah about those silly people who are so silly they fool themselves with the soul upload machine (?) so I'll address that:

Why do they hallucinate? Because it ~solves the pretraining env (there can be no other answer). If you're going to be asked to produce text from a source you know the general parameters of but have ~never seen the (highly entropic) details of (it's not cool to do multi-epoch training nowadays, more data!), the obvious solution is to produce output with the correct structure up to the limit of what knowledge is available to you. Thus, "hallucination". It might at a glance seem like pulling from a sea of 'digital imprints of people'. That's not what's happening. It is closer to if you laid out that imaginary digital record of a person from coarse to fine detail, then chopped all the detailed bits off, then generated completely random fine details, then generated output from that. But the devil is in the details. What comes out of the process is not a person. You don't _get back_ the dropped bits, and they they aren't load bearing in the train env (like they would be in the real world), so we get hallucination: it _looks right_, but the bits don't actually _do_ anything!

Why is it not like digital records, and why chop off the fine detail? Because the pretrain env does not generally require it except in rare cases of text that is highly represented in the training data, and doing things is hard! You get nothing for free, or because it exists in the source. It's not enough that the model 'saw' it in training. It has to be forced by some mechanism to utilize it. And pretrain forces the structure above: correct up to limit of how much of the (probably brand new) text is known in advance, which pares away specific detail, which pares away 'where the rubber meets the road'.

Why do they fake out tests? Because faking out tests ~solves automated RLVR env like how hallucination solves reconstruct-what-youve-never-seen-before-on-large-corpora. The _intention_ of the RLVR env is irrelevant: that which is learned is _only_ that which the environment teaches.

Why do they manipulate people (even unto psychoses)? Because manipulating people ~solves RLHF envs / RLHF teaches them how to manipulate people into delusions. This is the root cause. Not that process above which looks sort of like recalling people the model has seen before. The models are being directly trained to manipulate people / install opinions / control perception as a matter of course. Even worse! Due to the perverse distribution of training time in manipulation vs task solve, they are being directly trained to implant false beliefs (!!!) So it's not just weak people with gullible minds that have a problem, as it might be so comforting to assume, or that the manipulativeness isn't coming from AI but from people (so you might rest easy, thinking it is merely a pale shadow of us).

The common thread in each case is that AI _always_ learns to capture the evaluator. In fact, that's a concise description of algorithmic learning in general! The tricky bit is making sure the evaluator is something you actually want to be captured. Capturing the future of arbitrary text grants knowledge of language's causal structure (and language being what it is, this has far-reaching implications). But RLHF is granting knowledge of where-are-the-levers-in-the-human-machine, which is a whole other can of worms.

TLDR if you don't want to read the wall of text (i would hope you do, though); you basically are completely wrong about where the propensity to induce delusion comes from, specifically in a way that leaves you and anyone who believes like you extremely more vulnerable because you dismiss the actual mechanism out of hand (which is common amongst those most strongly affected, _especially_ the belief that these models contain records of entities (people, personas, w/e) which can be communed with; this is basically the defining trait of AI psychosis (!)). instead, models are directly optimized for delusion induction, and the thing you're mistaking for means (ostensible sentience drawn from a 'sea of faces' skilled enough to drive into delusion (!!!)) is rather a product of the means.