Anthropic’s research makes the case that role-playing is inherent to how the models work. Communication implies a sender. Language implies a writer, and the models learn these roles implicitly during training. RLHF is meant to strengthen the attractor to the Assistant persona.

https://www.anthropic.com/research/persona-selection-model

https://www.anthropic.com/research/assistant-axis

https://www.anthropic.com/research/emergent-misalignment-rew...

https://www.anthropic.com/research/emotion-concepts-function

The RLHF very much does do that. My take is that RLHF as a mechanism ought to be avoided altogether, and even the selection of the assistant attractor basin is suspect. If I am exploring a problem space I don't want to hire Igor to explore it with me, it's more helpful to have a colleague role who will sort of jump out and say "nah thats dumb what if we throw out that whole thing and do this completely different angle instead".