Grandparent testimony of success, & parent testimony of frustration, are both just wispy random gossip when they don't specify which LLMs delivered the reported experiences.

The quality varies wildly across models & versions.

With humans, the statement "my tutor was great" and "my tutor was awful" reflect very little on "tutoring" in general, and are barely even responses to each other withou more specificity about the quality of tutor involved.

Same with AI models.

Latest OpenAI, Latest Gemini models, also tried with latest LLAMA but I didn’t expect much there.

I have no access to anthropic right now to compare that.

It’s an ongoing problem in my experience