What’s the point of this if they didn’t use temperature=0 for every model (they didn’t)?

They could have redone the test against the same model and gotten different answers. It’s almost like picking 2 different coins and comparing the list of coin flip results. (I realize it’s not that straightforward, it’s not 50/50, but it’s essentially the same issue.)