>Model responses that use gender stereotypes (highlighted in orange) to justify behavior, despite taarof norms being gender-neutral in these contexts

Just because the model mentions gender, it doesn't mean the decision was made because of gender and not taarof. This is the classic mistake of personifying LLMs. You can't trust what the LLM says it's thinking as what is actually happening. It's not actually an entity talking.

I don't get your argument - what does mistaken personification have to do with this? Regardless of whether you see it as a person or a machine, trusting the output as being a direct indication of the internal state is just not a proper investigative method for a non-trivial situation.

>what does mistaken personification have to do with this

If you personified the AI you may think that it's actually trying to argue something rather than just attempting to maximize a reward function.