This essay is missing the words “cause” and “causal”. There is a difference between discovering causes and fitting curves. The search for causes guides the design of experiments, and with luck, the derivation of formulae that describe the causes. Norvig seems to be confusing the map (data, models) for the territory (causal reality).

A related* essay (2010) by a statistician on the goals of statistical modelling that I've been procrastinating on:

https://www.stat.berkeley.edu/~aldous/157/Papers/shmueli.pdf

To Explain Or To Predict?

Nice quote

We note that the practice in applied research of concluding that a model with a higher predictive validity is “truer,” is not a valid inference. This paper shows that a parsimonious but less true model can have a higher predictive validity than a truer but less parsimonious model.

Hagerty+Srinivasan (1991)

*like TFA it's a sorta review of Breiman

is it more than a commentary on overfitting to the tune of "with enough epicycles you can make the elephant wiggle its trunk"?

If you are referring to Hagerty+Srinivasan:

They certainly didn't think that a better fit => "truer".

They used the term "truer" to describe a model that more accurately captures the underlying causal structure or "true" relationship between variables in a population.

As for the paper I linked, I still haven't read it closely enough to confirm that D-Machine's comment below is a good dismissal.

I'm inclined to think it's more like "interpolating vs extrapolating"

This essay frequently uses the word "insight", and its primary topic is whether an empirically fitted statistical model can provide that (with Norvig arguing for yes, in my opinion convincingly). How does that differ from your concept of a "cause"?

> I agree that it can be difficult to make sense of a model containing billions of parameters. Certainly a human can't understand such a model by inspecting the values of each parameter individually. But one can gain insight by examing (sic) the properties of the model—where it succeeds and fails, how well it learns as a function of data, etc.

Unfortunately, studying the behavior of a system doesn't necessarily provide insight into why it behaves that way; it may not even provide a good predictive model.

Norvig's textbook surely appears on the bookshelf of researchers including those building current top LLMs. So it's odd to say that such an approach "may not even provide a good predictive model". As of today, it is unquestionably the best known predictive model for natural language, by huge margin. I don't think that's for lack of trying, with billions of dollars or more at stake.

Whether that model provides "insight" (or a "cause"; I still don't know if that's supposed to mean something different) is a deeper question, and e.g. the topic of countless papers trying to make sense of LLM activations. I don't think the answer is obvious, but I found Norvig's discussion to be thoughtful. I'm surprised to see it viewed so negatively here, dismissed with no engagement with his specific arguments and examples.

You can look into Judea Pearl's definitions of causality for more information.

Pearl defines a ladder of causation:

1. Seeing (association) 2. Doing (intervention) 3. Imagining (counterfactuals)

In his view - most ML algos are at level 1 - they look at data and draw associations, and "agents" have started some steps in level 2 - doing.

The smartest of humans operate mostly in level (3) of abstractions - where they see things, gain experience, and later build up a "strong causal model" of the world and become capable of answering "what if" questions.

Thanks for the response, but (per the omitted portion of my sentence before the semicolon) I was not talking about the M in LLM. I was talking about a conceptual or analytic model that a human might develop to try to predict the behavior of an LLM, per Norvig's claim of insight derived from behavioral observation.

But now that I think a bit about it, the observation that an LLM seems to frequently produce obviously and/or subtly incorrect output, is not robust to prompt rewording, etc. is perhaps a useful Norvig-style insight.

Chomsky's talking about predictive models in the context of cognitive science. LLMs aren't really a predictive model of any aspect of human cognitive function.

The generation of natural language is an aspect of human cognition, and I'm not aware of any better model for that than current statistical LLMs. The papers mapping between EEG/fMRI/etc. and LLM activations have been generally oversold so far, but it's active area of research for good reason.

I'm not saying LLMs are a particularly good model, just that everything else is currently worse. This includes Chomsky's formal grammars, which fail to capture the ways humans actually use language per Norvig's many examples. Do you disagree? If so, what model is better and why?

I’m not really sure what you’re getting at. Could you point to some papers exemplifying the kind of work that you’re thinking of? Of course there are lots of people training LLMs and other statistical models on EEG data, but that does not show that, say, GPT-5, is a good model of any aspect of human cognition.

Chomsky, of course, never attempted to model the generation of natural language and was interested in a different set of problems, so LLMs are not really a competitor in that domain anyway (even if you take the dubious step of accepting them as scientific models).

I certainly don’t agree with Norvig, but he doesn’t really understand the basics of what Chomsky is trying to do, so there is not much to respond to. To give three specific examples, he is (i) confused in thinking that Gold’s theorem has anything to do with Chomsky’s arguments, (ii) appears to think that Chomsky studied the “generation of language” because he he’s read so little of Chomsky’s work that he doesn’t know what a “generative grammar” is, and (iii) believes that Chomsky thinks that natural languages are formal languages in which every sentence is either clearly in the language or not (again because he’s barely read anything that Chomsky wrote since the 1950s). Then, just to make absolutely sure not to be taken seriously, he compares Chomsky to Bill O’Reilly!

> I'm surprised to see it viewed so negatively here, dismissed with no engagement with his specific arguments and examples.

I struggle to motivate engaging with it because it is unfortunately quite out of touch with (or just ignores) some core issues and the major advances in causal modeling and causal modeling theory, i.e. Judea Pearl and do-calculus, structural equation modeling, counterfactuals, etc [1].

It also, IMO, makes a (highly idiosyncratic) distinction between "statistical" (meaning, trained / fitted to data) and "probabilistic" models, that doesn't really hold up too well.

I.e. probabilistic models in quantum physics are "fit" too, in that the values of fundamental constants are determined by experimental data, but these "statistical" models are clearly causal models regardless. Even most quantum physical models can be argued to be causal, just the causality is probabilistic rather than absolute (i.e. A ==> B is fuzzy implication rather than absolute implication). It's only if you ask deliberately broad ontological questions (e.g. "Does the wave function cause X") that you actually run into the problem of quantum models being causal or not, but for most quantum physical experiments and phenomena generally, the models are still definitely causal at the level of the particles / waves / fields involved.

IMO I don't want to engage much with the arguments because it starts on the wrong foot and begins by making, in my opinion, an incoherent / unsound distinction, while also ignoring or just being out of date with the actual scientific and philosophical progress and issues already made here.

I would also say there is a whole literature on tradeoffs between explanation (descriptive models in the worst case, causal models in the best case) and prediction (models that accurately reproduce some phenomenon, regardless of if they are based on and true description or causal model). There are also loads of examples of things that are perfectly deterministic and modeled by perfect "causal" models but which are of course still defy human comprehension / intuition, in that the equations need to be run on computers for us to make sense of them (differential equation models, chaotic systems, etc). Or just more practically, we can learn to do all sorts of physical and mental skills, but of course we understand barely anything about the brain and how it works and co-ordinates with the body. But obviously such an understanding is mostly irrelevant for learning how to operate effectively in the world.

I.e. in practice, if the phenomenon is sufficiently complex, an accurate causal model that also accurately models the system is likely to be too complex for us to "understand" anyway (or you just have identifiability issues so you can't decide between multiple different models; or you don't have the time / resources / measurement capacity to do all the experiments needed to solve the identifiability problem anyway), so there is almost always a tradeoff between accuracy/understanding. Understanding is a nice luxury, but in many cases not important, and in complex cases, probably not achievable at all. If you are coming from this perspective, the whole "quandary" of the essay seems just odd.

[1] https://plato.stanford.edu/entries/causal-models/

Unless and until neurologists find evidence of a universal grammar unit (or a biological Transformer, or whatever else) in the human connectome, I don't see how any of these models can be argued to be "causal" in the sense that they map closely to what's physically happening in the brain. That question seems so far beyond current human knowledge that any attempt at it now has about as much value as the ancient Greek philosophers' ideas on the subatomic structure of matter.

So in the meantime, Norvig et al. have built statistical models that can do stuff like predicting whether a given sequence of words is a valid English sentence. I can invent hundreds of novel sentences and run their model, checking each time whether their prediction agrees with my human judgement. If it doesn't, then their prediction has been falsified; but these models turned out to be quite accurate. That seems to me like clear evidence of some kind of progress.

You seem unimpressed with that work. So what do you think is better, and what falsifiable predictions has it made? If it doesn't make falsifiable predictions, then what makes you think it has value?

I feel like there's a significant contingent of quasi-scientists that have somehow managed to excuse their work from any objective metric by which to evaluate it. I believe that both Chomsky and Judea Pearl are among them. I don't think every human endeavor needs to make falsifiable predictions; but without that feedback, it's much easier to become untethered from any useful concept of reality.

I had this exact reaction, no discussion of "causal modeling" makes the whole thing seem horribly out of touch with the real issues here. You can have explanatory and predictive models that are causal models, or explanatory and predictive models that are non-causal, and that this the actual issue, not "explanation" vs. "prediction", which is not a tight enough distinction.