I didn't ask to roleplay, in this case it's just heavily hallucinating. If the model is wrong, it doesn't mean it's role-playing. In fact, 3.5 Sonnet responded correctly, and that's what's expected, there's not much defense for GPT-4o here.