Would be cool to have a benchmark with actually unsolved math and science questions, although I suspect models are still quite a long way from that level.
Would be cool to have a benchmark with actually unsolved math and science questions, although I suspect models are still quite a long way from that level.
Does folding a protein count? How about increasing performance at Go?
"Optimize this extremely nontrivial algorithm" would work. But unless the provided solution is novel you can never be certain there wasn't leakage. And anyway at that point you're pretty obviously testing for superintelligence.
It's worth noting that neither of those were accomplished by LLMs.