Good point. They said they validated the results by testing with other models (including Claude), as well as with manual sanity checks.

55% to 45% definitely isn't a blowout but it is meaningful — in terms of ELO it equates to about a 36 point difference. So not in a different league but definitely a clear edge