There's more to it, though. The inference code you linked to is Python. Unless my software is Python, I have to ship a CPython binary to run the inference code, then wire it up (or port it, if you're feeling spicy).

Ollama brings value by exposing an API (literally over sockets) with many client SDKs. You don't even need the SDKs to use it effectively. If you're writing Node or PHP or Elixir or Clojurescript or whatever else you enjoy, you're probably covered.

It also means that you can swap models trivially, since you're essentially using the same API for each one. You never need to worry about dependency hell or the issues involved in hosting more than one model at a time.

As far as I know, Ollama is really the only solution that does this. Or at the very least, it's the most mature.

The relationship between Ollama and llama.cpp is massively closer than it must seem.

Ollama is llama.cpp with a nice little installer GUI and nice little server binary.

llama.cpp has a server binary as well, however, no nice installer GUI.

The only time recently Ollama had a feature llama.cpp didn't was they patched SWA in with Google, llama.cpp had it a couple weeks later.

Ollama is significantly behind llama.cpp in important areas, ex. the Gemma blog post, they note they'll get on tool calls and multimodal real soon now.

I don't care about llama.cpp, just like I don't care about V8 when I reach for Node. And I suspect many other people don't, either. Lots of folks don't want to integrate a library. They don't want to download a model or weights. They want to `ollama run foo` and move on with their lives. I don't need to worry about whether my binary was compiled with the right flags on my MacBook versus a Linux server with an Nvidia GPU or setting gpu-layers or num_ctx.

> Ollama is significantly behind llama.cpp in important areas, ex. the Gemma blog post, they note they'll get on tool calls and multimodal real soon now.

If you don't use those things, you don't need to care. I'll just use another model that works.

And that's the thing really. Most folks don't give a shit about getting the maximum performance. They're probably not even keeping their GPU busy all the time. They just need it to work consistently without having to worry about nonsense. Llama.cpp simply isn't that tool.