So it's trained on the SWE Bench Pro evalset
That's not accurate. Take a look at the paper to see what it is trained on! And specifically decontamination is called out in A.4
https://microsoft.ai/wp-content/uploads/2026/06/main_2026060...
What is your evidence for this claim?
They say hill climbing
https://microsoft.ai/news/building-a-hillclimbing-machine-la...
Unless they specifically clarify that the testing and training benchmarks are completely separate, we have to assume they test on the same 'hill' the model climbs.
Hill climbing doesn't mean much but absolutely doesn't imply they cheat on benchmarks. They have more details here https://microsoft.ai/news/introducing-mai-thinking-1/ it seems to be "RL on everything".
[dead]
[flagged]
That's not accurate. Take a look at the paper to see what it is trained on! And specifically decontamination is called out in A.4
https://microsoft.ai/wp-content/uploads/2026/06/main_2026060...
What is your evidence for this claim?
They say hill climbing
https://microsoft.ai/news/building-a-hillclimbing-machine-la...
Unless they specifically clarify that the testing and training benchmarks are completely separate, we have to assume they test on the same 'hill' the model climbs.
Hill climbing doesn't mean much but absolutely doesn't imply they cheat on benchmarks. They have more details here https://microsoft.ai/news/introducing-mai-thinking-1/ it seems to be "RL on everything".
[dead]
[flagged]