Hacker News

Wouldn't this imply that most of the inference time storage and compute might be unnecessary?

If the hypothesis is true, it makes sense to scale up models as much as possible during training - but once the model is sufficiently trained for the task, wouldn't 99% of the weights be literal "dead weight" - because they represent the "failed lottery tickets", i.e. the subnetworks that did not have the right starting values to learn anything useful? So why do we keep them around and waste enormous amounts of storage and compute on them?

janalsncm 3 days ago [ - ]

Quick example, Kimi K2 is a recent large mixture of experts model. Each “expert” is really just a path within it. At each token, 32B out of 1T are active. This means only 3.2% are active for any one token.

Sophira 2 days ago [ - ]

That sounds surprisingly like "Humans only use 10% of their brain at any given time."

paulsutter 3 days ago [ - ]

That’s exactly how it works, read up on pruning. You can ignore most of the weights and still get great results. One issue is that sparse matrices are vastly less efficient to multiply.

But yes you’ve got it

tough 3 days ago [ - ]

someone on twitter was exploring and linked to some related papers where you can for example trim experts on a MoE model if you're 100% sure they're never active for your specific task

what the bigger wide net bigs you is generalization

markeroon 3 days ago [ - ]

Look into pruning

FuckButtons 3 days ago [ - ]

For any particular single pattern learned 99% of the weights are dead weight. But it’s not the same 99% for each lesson learned.