Hacker News

docandrew 4 days ago [ - ]

Hype aside, if you can get an answer to a computing problem with error bars in significantly less time, where precision just isn’t that important (such as LLMs) this could be a game changer.

alyxya 4 days ago [ - ]

Precision actually matters a decent amount in LLMs. Quantization is used strategically in places that’ll minimize performance degradation, and models are smart enough so some loss in performance still gives a good model. I’m skeptical how well this would turn out, but it’s probably always possible to remedy precision loss with a sufficiently larger model though.

fastball 4 days ago [ - ]

LLMs are inherently probabilistic. Things like ReLU throw out a ton of data deliberately.

alyxya 4 days ago [ - ]

No that isn’t throwing out data. Activation functions perform a nonlinear transformation to increase the expressivity of a function. If you did two matrix multiplications without ReLU in between, your function contains less information than with a ReLU in between.

fastball 3 days ago [ - ]

How are you calculating "less information"?

shwaj 3 days ago [ - ]

I think what they meant was:

Two linear transformations compose into a single linear transformation. If you have y = W2(W1*x) = (W2*W1)*x = W*x where W = W2*W1, you've just done one matrix multiply instead of two. The composition of linear functions is linear.

The ReLU breaks this because it's nonlinear: ReLU(W1*x) can't be rewritten as some W*x, so W2(ReLU(W1*x)) can't collapse either.

Without nonlinearities like ReLU, many layers of a neural network could be collapsed into a single matrix multiplication. This inherently limits the function approximation that it can do, because linear functions are not very good at approximating nonlinear functions. And there are many nonlinearities involved in modeling speech, video, etc.