But there is a distinction we can make between flowers and wasps. If there is no distinction we can make between Schwartz and non-Schwartz, then we are susceptible to the sample problem with or without AI. And if there is a distinction then we can use that distinction to test Bob, and make him learn from his test failures.

Sure.

But the whole point is that there is a significant difference between Schwartz and non-Schwartz, that only turns up after they start working for real, producing new work rather than rehashing established material, and it takes years to detect. By that time, Bob's forty.

It isn't a "sample problem" it's a process problem. By perpetually raising the stakes and focusing on metrics (e.g. grades, number of publications for students, graduation rates for schools) we've created and fallen into a Poe's law trap. Adding a new metric isn't likely to help.

What might help? Making the metrics harder to game (e.g. something like oral exams, early and often), more discerning (grade deflation), and moving the wrong-track consequences earlier (start holding people back in grade school, make failing to graduate high school easier, make getting into college harder, etc.), and change the cash-cow funding models to remove the perverse incentives.

We aren't likely to do any of these things.