Hacker News

> Apple really stumbled into making the perfect hardware for home inference machines

For LLMs. For inference with other kinds of models where the amount of compute needed relative to the amount of data transfer needed is higher, Apple is less ideal and systems worh lower memory bandwidth but more FLOPS shine. And if things like Google’s TurboQuant work out for efficient kv-cache quantization, Apple could lose a lot of that edge for LLM inference, too, since that would reduce the amount of data shuffling relative to compute for LLM inference.

NetMageSCW 8 hours ago [ - ]

Or just mean that you could run a 5x bigger model on Apple than before.

dragonwriter 8 hours ago [ - ]

Well, since its kv-cache that TurboQuant optimizes, it means five times bigger context fits into RAM, all other things being equal, not a five times bigger model. But, sure, with any given context size and the same RAM available, you can instead fit a bigger model—which also takes more compute to get the same performance.

Anything that increases the necessary compute to fully utilize RAM bandwidth in optimal LLM serving weakens Apples advantage for that.