1. On the technical:
The cache only makes generation fast, it doesn't influence what gets chosen next. The loops that hurt the most (point 2 below) are when the model re-decides to do the same thing in different words, which is much harder to detect automatically. We're experimenting with repetition penalty and turning thinking off to solve for the 1st kind of looping (below)
2. On "why is looping a problem" for us
Practical example, which I covered in the post: "add --json to every command that does a get or list in faas-cli" - this was a small-ish, open source CLI written with Cobra a very common framework.
If I send that to Claude (any of their models) or Codex (GPT), I would have a fully working solution the next time I opened that terminal - a few seconds - a few minutes.
With the local model, when it loops, you get some progress and start working on something else. Come back, maybe even 30 minutes later and see it's been printing the same 5 lines over and over constantly.
Trust is important for a tool like this, that eroded it.
The other type of loop I mention in the blog post is "unable to solve it" loop - Han ran into that more.
"Oh I need to fix the indent from 8 to 5 characters in main.py" "Wait I don't know how to write Python code" "Oh now it's broken and I don't know what to do, maybe I should stop" "Let me edit ... " etc, etc