> The agent cannot learn from its mistakes. The agent will never produce any output which will help you invoke future agents more safely
That is not entirely true:
Given that more and more LLM providers are sneaking in "we'll train on your prompts now" opt-outs, you deleting your database (and the agent producing repenting output) can reduce the chance that it'll delete my database in the future.
Actually no, it will increase it. Because it’ll be trained with the deletion command as a valid output.
Exactly. It’s just giving the LLM a token pattern, and it’s designed to reproduce token patterns. That’s all it does. At some point generating a token pattern like that again is literally it’s job.
Why would one set up reinforcement learning like that?
The point of creating samples from user data should surely be to label them good or bad, based on the whole conversation.
You look at what happened eventually, judge the outcome as bad, and thus train the "rm" token in the middle to be less likely.
It is possible, but it requires specifically labelling the data. You have to craft question response pairs to label. But even then the result is only probabilistic.
The LLM in this case had been very thoroughly trained and instructed quite specifically not to do many of the things it actually then when off and did.
It may be that there's a kind of cascade effect going on here. Possibly once the LLM breaks one rule it's supposed to follow, this sets it off on a pattern of rule violations. After all what constitutes a rule violation is there in the training set, it is a type of token stream the LLM has been trained on. It could be the LLM switches into a kind of black hat mode once it's violated a protocol that leads it down a path of persistently violating protocols, and given the statistical model some violations of protocol are always possible.
My mother was a primary school teacher. She used to say that the worst thing you can say to a bunch of kind leaving class down the hall is "don't run in the hall". It puts it in their minds. You need to say "Please walk in the hall", then they'll do it.