the point is that each question is something that a specialist in a field would be able to do, but deems challenging enough that the ability to solve it would imply significant general usefulness in that domain

I mean they could just feed the solutions into the training data. Then suddenly the bot will do real good at HLE.