If that is the argument though, current AI aren't just autocomplete - because we could reasonably show an AI an image or a video and have them call a tool rather than return text. That'd be comparable to a pre-language human.

I'm not seeing the comparison because what you're describing is not at all an internal or emergent process. Without a human there to hand-feed the AI access to create these tools, create the interface, and then hand feed all of it the LLM, none of this happens. This is like Kubrick's monolith, bumped up a few orders of magnitude.