Interestingly, my criticism is exactly the opposite of yours. I think as LLMs become more and more capable (and crucially multi-modal) we will need the external tools less and less.

For example, why would I want an MCP that can drive Photoshop on my behalf? Like I say to the LLM "remove this person from the photo" and it opens Photoshop, uses the magic wand select tool, etc. That is silly in my mind. I want to say "remove this person" and the LLM sends me a perfect image with the person gone.

I extend that idea for just about any purpose. "Edit this video in such and such a way". "Change this audio in such and such a way". "Update this 3d model in such and such a way". No tool needed at all.

And that will lead to more multi-modal input. Like, if I could "mark up" a document with pen marks, or an image. I want tools that are a bit better than language for directing the attention of the model towards the goals I want them to achieve. Those will be less "I am typing text into a chat interface with bubbles" but the overall conversational approach stays intact.