Hacker News

For context, I'm feeling like I have a "free Sonnet" now that I've got Qwen3.6 35B running on my 5070ti at home (I connect to it via Tailscale). I run it _almost exactly_ the same as this Reddit post which found a good way to squeeze the 35B model onto a GPU with 16GB of VRAM: https://www.reddit.com/r/LocalLLaMA/comments/1sor55y/rtx_507... I really like it because it's slightly more operationally complex (I had to write a script to start it) but now that I have it, I literally never have to change it. It's a folder with the llama-server in it and with the model.gguf in it, I run the script which starts serving the model, done.

Like that post, I get 75 tokens/second. The exact model is: Qwen3.6-35B-A3B-UD-Q4_K_M.gguf and I get 128k of context

I run it on my home machine and connect to it from anywhere over tailscale. I connect through the opencode CLI which I configure with this as provider by adding the following to my `~/.config/opencode/opencode.json`:

    {
      "provider": {
        "vllm": {
          "npm": "@ai-sdk/openai-compatible",
          "name": "local-llm-qwen3.6-35B",
          "options": {
            "baseURL": "http://homepc.tail987654.ts.net:8033/v1"
          },
          "models": {
            "Qwen3.6-35B-A3B-UD-Q4_K_M.gguf": {
              "name": "Qwen3.6-35B"
            }
          }
        }
      }
    }