They seem… much better than all the models they compared against? What’s the catch?

They only showed the benchmarks where they outperformed?

It's twice the size?