For a model that claims to focus on many languages, it's quite unreliable when it comes to simple questions like "how to say X in language Y" or "how to conjugate verb X in language Y". It keeps hallucinating words that do not exist, and when corrected, it only hallucinates a new lie.
it probably doesnt know what language each set of words is referencing.
i doubt they are including a lot of training data labeled with the language.
"how to say X in language Y" is a different task from saying X in language Y
Actually, it isn't all that different. There are only two words separating "how to say X in language Y" from "say X in language Y". And this "vulgar" metric is actually quite relevant for an LLM, which answers based on conversational context.