Amazing article. I was under the misapprehension that temp and other output parameters actually do affect caching. Turns out I was wrong and this explains why beautifully.
Because in my mind, as a person not working directly on this kind of stuff, I figured that caching was done similar to any resource caching in a webserver environment.
It´s a semantics issue where the word caching is overloaded depending on context. For people that are not familiar with the inner workings of llm models, this can cause understandable confusion.
Being wrong about details like this is exactly what I would expect from a professor. They are mainly grant writers and PhD herders, often they are good at presenting as well, but they mostly only have gut feelings about technical details of stuff invented after they became a professor.
Excellent HN-esque innovation in moderation: immediate improvement in S/N ratio, unobtrusive UX, gentle feedback to humans, semantic signal to machines.
How was the term "rug" chosen, e.g. in the historical context of newspaper folds?
I'd note, when I gave the input/output screenshot to ChatGPT 5.2 it failed on it (with lots of colorful chain of thought), though Gemini got it right away.
Thanks for sharing; you clearly spent a lot of time making this easy to digest. I especially like the tokens-to-embedding visualisation.
I recently had some trouble converting a HF transformer I trained with PyTorch to Core ML. I just couldn’t get the KV cache to work, which made it unusably slow after 50 tokens…
Hopefully you can write the teased next article about how Feedforward and Output layers work. The article was super helpful for me to get better understanding on how LLM GPTs work!
Amazing article. I was under the misapprehension that temp and other output parameters actually do affect caching. Turns out I was wrong and this explains why beautifully.
Great work. Learned a lot!
Yay, glad I could help! The sampling process is so interesting on its own that I really want to do a piece on it as well.
Looking forward to it!
I had a “somebody is wrong on the internet!!” discussion about exactly this a few weeks ago, and they proclaimed to be a professor in AI.
Where do people get the idea from that temperature affects caching in any way? Temperature is about next token prediction / output, not input.
Because in my mind, as a person not working directly on this kind of stuff, I figured that caching was done similar to any resource caching in a webserver environment.
It´s a semantics issue where the word caching is overloaded depending on context. For people that are not familiar with the inner workings of llm models, this can cause understandable confusion.
Being wrong about details like this is exactly what I would expect from a professor. They are mainly grant writers and PhD herders, often they are good at presenting as well, but they mostly only have gut feelings about technical details of stuff invented after they became a professor.
Excellent HN-esque innovation in moderation: immediate improvement in S/N ratio, unobtrusive UX, gentle feedback to humans, semantic signal to machines.
How was the term "rug" chosen, e.g. in the historical context of newspaper folds?
Really well done article.
I'd note, when I gave the input/output screenshot to ChatGPT 5.2 it failed on it (with lots of colorful chain of thought), though Gemini got it right away.
Huh, when I was writing the article it was GPT-5.1 and I remember it got it no problem.
Thanks for sharing; you clearly spent a lot of time making this easy to digest. I especially like the tokens-to-embedding visualisation.
I recently had some trouble converting a HF transformer I trained with PyTorch to Core ML. I just couldn’t get the KV cache to work, which made it unusably slow after 50 tokens…
Thank you so much <3
Yes, I recently wrote https://github.com/samwho/llmwalk and had a similar experience with cache vs no cache. It’s so impactful.
Hopefully you can write the teased next article about how Feedforward and Output layers work. The article was super helpful for me to get better understanding on how LLM GPTs work!
Yeah! It’s planned for sure. It won’t be the direct next one, though. I’m taking a detour into another aspect of LLMs first.
I’m really glad you liked it, and seriously the resources I link at the end are fantastic.
What an excellent write-up. Thank you!
Thank you so much <3