How much is it because they were still too early?

Would they do significantly better with a model like Claude 4 than I’m guessing something worse than GPT3.5?