I think 'nopinsight' and the paper are arguing that the drop is 10%, not that the final score is 10%. For example, Deepseek-R1 dropped from 96.30 to 85.19. Are you actually arguing that a child guessing randomly would be able to score the same, or was this a misunderstanding?