Hacker News

or you can just load up ollama, have it load a local model and point claude or opencode at it...

is this article old? It's not. I'm not sure why he went through all the bother of llama.cpp

That was exactly my same question. Then I finished reading the post. The reason is pretty clear, and written in the post: it is faster than ollama+mlx.

sleepybrett a day ago [ - ]

how much faster?

freerunnering 17 hours ago [ - ]

I was benchmarking different models, different engines, and different draft models, I posted a video on twitter, and people started asking about the setup in the final screen recording. So the blog post isn't so much "how a beginner should setup something" it's "here's the setup I posted in the video".

Original video: https://x.com/Freerunnering/status/2065275403548168398

And in the blog post there is a table showing the different speeds I got from different engines.

Slowest combo was 38.1 tk/s, and the fastest was 72.2 tk/s. All from "the same" model.

krzyk 12 hours ago [ - ]

ollama is a wrapper on top of llama.cpp, and it makes llama.cpp slower, why use it?

Also Ollama has other issues (like forgetting what it really is - a wrapper).