The test is rigged because they used non thinking models.

These are reasoning / thinking models

Source?