The relationship between Ollama and llama.cpp is massively closer than it must seem.
Ollama is llama.cpp with a nice little installer GUI and nice little server binary.
llama.cpp has a server binary as well, however, no nice installer GUI.
The only time recently Ollama had a feature llama.cpp didn't was they patched SWA in with Google, llama.cpp had it a couple weeks later.
Ollama is significantly behind llama.cpp in important areas, ex. the Gemma blog post, they note they'll get on tool calls and multimodal real soon now.
I don't care about llama.cpp, just like I don't care about V8 when I reach for Node. And I suspect many other people don't, either. Lots of folks don't want to integrate a library. They don't want to download a model or weights. They want to `ollama run foo` and move on with their lives. I don't need to worry about whether my binary was compiled with the right flags on my MacBook versus a Linux server with an Nvidia GPU or setting gpu-layers or num_ctx.
> Ollama is significantly behind llama.cpp in important areas, ex. the Gemma blog post, they note they'll get on tool calls and multimodal real soon now.
If you don't use those things, you don't need to care. I'll just use another model that works.
And that's the thing really. Most folks don't give a shit about getting the maximum performance. They're probably not even keeping their GPU busy all the time. They just need it to work consistently without having to worry about nonsense. Llama.cpp simply isn't that tool.