Hacker News

Because most of the people squeezing that highly quantized small model into their consumer gpu don't get how they have left no room for the activation weights, and are stuck with a measly small context.