Enjoyed the article. To play devil’s advocate, an entirely different explanation for why huge models work: the primary insight was framing the problem as next-word prediction. This immediately creates an internet-scale dataset with trillions of labeled examples, which also has rich enough structure to make huge expressiveness useful. LLMs don’t disprove bias-variance tradeoff; we just found a lot more data and the GPUs to learn from it.
It’s not like people didn’t try bigger models in the past, but either the data was too small or the structure too simple to show improvements with more model complexity. (Or they simply trained the biggest model they could fit on the GPUs of the time.)
I think a lot of it is the massive amount of compute we've got in the last decade. While inference may have been possible on the hardware the training would have taken lifetimes.
I have a textbook somewhere in the house from about 2000 that says that there is no point having more than three layers in a neural network.
Compute was just too expensive to have neural networks big enough for this not to be true.
Once you have three layers (i.e. one "hidden" layer) then you can map to arbitrary functions, so a three layer network has the same "power" as an arbitrarily large network.
I'm sure that's what the text book meant, rather than any point about the expense of computing power.
People believe that more parameters would lead to overfit instead generalization. The various regularization methods we use today to avoid overfit hadn't been discovered yet. Your statement is mostly likely about this.
I think the problems with big network were diminishing gradients, which is why we now use the ReLU activation function, and training stability, which were solved with residual connections.
Overfitting is the problem of having too little training data for your network size.
Possibly, I would have to dig up the book to check. IIRC it did not mention overfitting but it was a long time ago.
But regarding this Lottery Ticket hypothesis, what it means is that a small percentage of the parameters can be identified such that: when those parameters are taken by themselves and reset to their original pre-training weights, and the resulting network is trained on the same data as the parent, it performs similarly to the parent. So in fact, it seems that far fewer parameters are needed to encode predictions across the Internet-scale dataset. The large model is just creating a space in which that small "rock star" subset of parameters can be automatically discovered. It's as if the training establishes a competition among small subsets of the network, where a winner emerges.
Perhaps there is a kind of unstable situation whereby once the winner starts to anneal toward predicting the training data, it is doing more and more of the predictive work. The more relevant the subset shows itself to the result, the more of the learning it captures, because it is more responsive to subsequent learning.
Same thing with Computer Vision, as Andrew Ng pointed out, the main thing that enabled the rapid progress was not new models, but mostly due to large _labeled_ datasets, particularly ImageNet.
Yes larger usable datasets, paired with an acceleration of mainstream parallel computing power (GPUs), with increasing algorithm flexibility (CUDA).
Without all three, progress would have been much slower.
Do you have a link handy for where he says this explicitly?
Here's any older interview where he talks about the need for accurate dataset labeling -
"In many industries where giant data sets simply don’t exist, I think the focus has to shift from big data to good data. Having 50 thoughtfully engineered examples can be sufficient to explain to the neural network what you want it to learn."
https://spectrum.ieee.org/andrew-ng-data-centric-ai
Language then would be the key factor enabling complex learning in meat space too? I feel like I’ve heard this debate before….
I think it doesn't have to follow. You could also generalize the idea and see learning as successfully being able to "use the past to predict the future" for small time increments. Next-word prediction would be one instance of this, but for humans and animals, you could imagine the same process with information from all senses. The "self-supervised" trainset is then just, well, life.
I'm no expert but have been thinking about this a lot lately. I wouldn't be surprised - language itself seems to be an expression of the ability to create an abstraction, distill the world into compressed representations, and manipulate symbols. It seems fundamental to human intelligence.
As a layman, it helps me to understand the importance of language as a vehicle of intelligence by realizing that without language, your thoughts are just emotions.
And therefore I always thought that the more you master a language the better you are able to reason.
And considering how much we let LLMs formulate text for us, how dumb will we get?
> without language, your thoughts are just emotions
That's not true. You can think "I want to walk around this building" without words, in abstract thoughts or in images.
Words are a layer above the thoughts, not the thoughts themselves. You can confirm this if you ever had the experience of trying to say something but forgetting the right word. Your mind knew what it wants to say but it didn't knew the word.
Chess players operate on sequences of moves dozen turns ahead in their minds using no words, seeing the moves on the virtual chessboards they imagine.
Musicians hear the note they want to play in their minds.
Our brains have full multimedia support.
It's probably not as simple as just being emotions, but actually there's a really interesting example here: Helen Keller. In her autobiography she describes what it was like before she learned language, and how she remembers it being almost unconscious and just a mix of feelings and impulses. It's fascinating.
> without language, your thoughts are just emotions.
Is that true though? Seems like you can easily have some cognitive process that visualizes things like cause and effect, simple algorithms or at least sequences of events.
In other words, we're rediscovering the lessons from George Orwell's Nineteen Eighty-Four. Language is central to understanding; remove subversive language and you remove the ability to even think about it.
I think that the takeaway message for meat space (if there is one) is that continuous life-long learning is where it is at: keep engaging your brain and playing the lottery in order to foster the winning tickets. Be exposed to a variety of stimuli and find relationships.
as a researcher in NLP slash computational linguistics, this is what I tend to think :) (maybe a less strong version, though, there are other kinds of thinking and learning).
so I'm always surprised when some linguists decry LLMs, and cling to old linguistics paradigms instead of reclaiming the important role of language as (a) vehicle of intelligence.
Why does 'next-word prediction' explain why huge models work? You saying we needed scale, and saying we use next-word prediction, but how does one relate to the other? Diffusion models also exist and work well for images, and they do seem to work for LLMs too.
I think it's the same underlying principle of learning the "joint distribution of things humans have said". Whether done autoregressively via LLMs or via diffusion models, you still end up learning this distribution. The insight seems to be the crazy leap that this is A) a valid thing to talk about and B) that learning this distribution gives you something meaningful.
The leap is in transforming an ill-defined objective of "modeling intelligence" into a concrete proxy objective. Note that the task isn't even "distribution set of valid/true things", since validity/truth is hard to define. It's something akin to "distribution of things a human might say" implemented in the "dumbest" possible way of "modeling the distribution of humanity's collective textual output".
To crack NLP we needed a large dataset of labeled language examples. Prior to next-word prediction, the dominant benchmarks and datasets were things like translation of English to German sentences. These datasets were on the order of millions of labeled examples. Next-word prediction turned the entire Internet into labeled data.
RNN worked that way too, the difference is that Transformers are parallelized, which is what made next-word prediction work so good, you could have an input thousands of tokens in length without needing your training to be thousands of times longer.