Hacker News

I'm also excited for local LLM's to be capable of assisting with nontrivial coding tasks, but we're far from reaching that point. VRAM remains a huge bottleneck for even a top-of-the-line gaming PC to run them. The best these days for agentic coding that get close to the vibe-check of frontier models seem to be Qwen3-Coder-480B-A35B-Instruct, DeepSeek-Coder-V2-236B, GLM 4.5, and GPT-OSS-120B. The latter being the only one capable of fitting on a 64 to 96GB VRAM machine with quantization.

Of course, the line will always be pushed back as frontier models incrementally improve, but the quality is night and day between these open models consumers can feasibly run versus even the cheaper frontier models.

That said, I too have no interest in this if local models aren't supported and hope that's down the pipeline just so I can try tinkering with it. Though it looks like it utilizes multiple models for various tasks (planner, programmer, reviewer, router, and summarizer) so that only adds to the difficulty of the VRAM bottleneck if you'd like to load different models per task. So I think it makes sense for them to focus on just Claude for now to prove the concept.

edit: I personally use Qwen3 Coder 30B 4bit for both autocomplete and talking to an agent, and switch to a frontier model for the agent when Qwen3 starts running in circles.

diggan 4 days ago [ - ]

> and GPT-OSS-120B. The latter being the only one capable of fitting on a 64 to 96GB VRAM machine with quantization.

Tiny correction: Even without quantization, you can run GPT-OSS-120B (with full context) on around ~60GB VRAM :)

prophesi 3 days ago [ - ]

Hm I don't think so. You might be thinking about the file size, which is ~64GB.

> Native MXFP4 quantization: The models are trained with native MXFP4 precision for the MoE layer, making gpt-oss-120b run on a single 80GB GPU (like NVIDIA H100 or AMD MI300X) and the gpt-oss-20b model run within 16GB of memory.

If you _could_ fit it within ~60GB VRAM, the variability of the amount of VRAM required for certain context lengths and prompt sizes would OOM pretty quickly.

edit: Ah and MXFP4 in itself is a quantization, just supposedly closer to the original FP16 than the rest with a smaller VRAM requirement.

diggan 3 days ago [ - ]

> Hm I don't think so. You might be thinking about the file size, which is ~64GB.

No, the numbers I put above is literally the VRAM usage I see when I load 120B with llama.cpp, it's a real-life number, not theoretical :)