For context, I'm feeling like I have a "free Sonnet" now that I've got Qwen3.6 35B running on my 5070ti at home (I connect to it via Tailscale). I run it _almost exactly_ the same as this Reddit post which found a good way to squeeze the 35B model onto a GPU with 16GB of VRAM: https://www.reddit.com/r/LocalLLaMA/comments/1sor55y/rtx_507... I really like it because it's slightly more operationally complex (I had to write a script to start it) but now that I have it, I literally never have to change it. It's a folder with the llama-server in it and with the model.gguf in it, I run the script which starts serving the model, done.
Like that post, I get 75 tokens/second. The exact model is: Qwen3.6-35B-A3B-UD-Q4_K_M.gguf and I get 128k of context
I run it on my home machine and connect to it from anywhere over tailscale. I connect through the opencode CLI which I configure with this as provider by adding the following to my `~/.config/opencode/opencode.json`:
{
"provider": {
"vllm": {
"npm": "@ai-sdk/openai-compatible",
"name": "local-llm-qwen3.6-35B",
"options": {
"baseURL": "http://homepc.tail987654.ts.net:8033/v1"
},
"models": {
"Qwen3.6-35B-A3B-UD-Q4_K_M.gguf": {
"name": "Qwen3.6-35B"
}
}
}
}
}