That’s not a valid analogy: humans reliably perform that task billions of times daily. It’s still routine to find cases which reveal that while models may have improved on some basic tasks (or learned to call a tool) there isn’t a deeper understanding of the underlying concept or the problem they’re being asked to solve.

And AI agents reliably-ish do tasks billions of times a day that humans struggle with, namely regurgitating information at incredible rates across wide breadths of topics. I see it as merely a matter of degree, not category.

How do you measure "deeper understanding" in humans? You usually do it by asking them to show their work, show how the dots connect. Reasoning models are getting there, and when they do, I'm sure the goalposts will move yet again.