Why not use these as a benchmark for LLM ability to make breakthrough discoveries?

For example prompt the 1913 model to try and “Invent a new theory of gravity that doesn’t conflict with special relativity”

Would it be able to eventually get to GR? If not, could finding out why not illuminate important weaknesses.