Nobody releases numbers that show them to be worse than competitors lol.

This even applies to OpenAI & Anthropic who don't even eval on the same datasets a lot of the time.