Hacker News

I use podman compose to spin up an Open WebUI container and various Llama.cpp containers, 1 for each model. Nothing fancy like a proxy or anything. Just connect direct. I also use Continue extension inside vscode, and always use devcontainers when I'm working with any LLMs.

I had to create a custom image of llama.cpp compiled with vulkan so the LLMs can access the GPU on my MacBook Air M4 from inside the containers for inference. It's much faster, like 8-10x faster than without.

To be honest so far I've been using mostly cloud models for coding, the local models haven't been that great.

Some more details on the blog: https://markjgsmith.com/posts/2025/10/12/just-use-llamacpp