I mean, they train their model on their training data. So by it should score well on their own benchmark.