Hacker News

> i am guessing they found early in training that an agreeable LLM is better received than one which is constantly truthful and considers you to be pretty dumb

My sense is that this is sort of accurate, but more likely it's a result of two things:

1. LLMs are still next-token predictors, and they are trained on texts of humans, which mostly collaborate. Staying on topic is more likely than diverging into a new idea.

2. LLMs are trained via RLHF which involves human feedback. Humans probably do prefer agreeable LLMs, which causes reinforcement at this stage.

So yes, kinda. But I'm not sure it's as clear-cut as "the researchers found humans prefer agreeableness and programmed it in."