Hacker News

I can see that and I don't know your setup, but there are people pushing >70t/s with MTP on a single 3090, with big contexts still >50t/s. 64k is not a lot for agentic coding, and IIRC 128k with turboquant and the likes should be possible for you. r/LocalLLM/ and r/LocalLLaMA/ are worth a visit IMO.

EDIT: just found this recipe repo, may wanna give it a go: https://github.com/noonghunna/club-3090

EDIT-2: this can also shave off a lot of context need for tool calling -> https://github.com/rtk-ai/rtk

gchamonlive 3 hours ago [ - ]

club-3090 with llamacpp did it. Full 262k context, usable in oh-my-pi. Still testing it, but initial results are promising.

I had to make a couple of adjustments though. After downloading the model with hf, I needed to move the mmproj-F16.gguf to the parent folder:

   tree /media/fast-storage/club-3090-models/qwen3.6-27b/
  /media/fast-storage/club-3090-models/qwen3.6-27b/
  ├── mmproj-F16.gguf
  └── unsloth-q3kxl
      └── Qwen3.6-27B-UD-Q3_K_XL.gguf

then, on starting the server, the container would complain that llama-server wasn't a known binary, so I needed to add PATH="/app:$PATH" to the entrypoint of the llama service.

The only things that's missing is for llama to emit thinking blocks that oh-my-pi can parse, but it's running alright. That's mostly cosmetic.

gchamonlive a day ago [ - ]

I managed to execute with vllm successfully, but it breaks opencode on simple "what's this repo?" task. On oh-my-pi it wont event execute because omp sends multiple system prompts. I'll try with llama.cpp later and see if it works more reliably.

gchamonlive a day ago [ - ]

will give more info in the post

EDIT: thanks for the links!