The amount of faith a person has in LLMs getting us to e.g. AGI is a good implicit test of how much a person (incorrectly) thinks most thinking is linguistic (and to some degree, conscious).

Or at least, this is the case if we mean LLM in the classic sense, where the "language" in the middle L refers to natural language. Also note GP carefully mentioned the importance of multimodality, which, if you include e.g. images, audio, and video in this, starts to look like much closer to the majority of the same kinds of inputs humans learn from. LLMs can't go too far, for sure, but VLMs could conceivably go much, much farther.

And the latest large models are predominantly LMMs (large multimodal models).

Sort of, but the images, video, and audio they have available are far more limited in range and depth than the textual sources, and it also isn't clear that most LLM textual outputs are actually drawing too much on anything learned from these other modalities. Most of the VLM setups are the other way around, using textual information to augment their vision capacities, and even further, most mostly aren't truly multi-modal, but just have different backbones to handle the different modalities, or are even just models that are switched between with a broader dispatch model. There are exceptions, of course, but it is still today an accurate generalization that the multimodality of these models is kind of one-way and limited at this point.

So right now the limitation is that an LMM is probably not trained on any images or audio that is going to be helpful for stuff outside specific tasks. E.g. I'm sure years of recorded customer service calls might make LMMs good at replacing a lot of call-centre work, but the relative absence of e.g. unedited videos of people cooking is going to mean that LLMs just fall back to mostly text when it comes to providing cooking advice (and this is why they so often fail here).

But yes, that's why the modality caveat is so important. We're still nowhere close to the ceiling for LMMs.