There were some iffy things about the text to SQL datasets though, historically.
People got good results on the test datasets, but the test datasets had errors so the high performance was actually just the models being overfitted.
I don't remember where this was identified, but it's really recent, but before GPT-5.