What a neat bench mark! I'm blown away that o1 absolutely crushes everyone else in this. I guess the chain of thought really hashes out those associations.

Isn't it possible that o1 was also trained on this data (or something super similar) directly? The score seems disproportionately high.

They definitely considered it. Early theinformation articles talked about how high the performance of strawberry was on it.