Hacker News

I have the RAM, but not the VRAM. What kind of speed/tps could you expect from a 3090 with 24GBs of RAM? I am somewhat tempted to pick a GPU with 24GBs of RAM.

ekidd 13 hours ago [ - ]

A GPU with 24GBs of RAM is mostly useful for running a very carefully squeezed Qwen3.6 27B (4-bit Unsloth quants, 8-bit K/V cache, possibly MTP, 128k context). This is a fun little model that's smart enough to do debugging, refactoring, and implementing "clean" specs that don't force it to make complicated design choices. I've seen it rip through a 9-year-old Terraform AWS config, and (without using the network) correctly identify nearly everything that would need to be upgraded or migrated for modern AWS. But if I give it some poorly conceived spec with lurking design headaches, then it goes on an endless thinking binge and ultimately fails.

Speed-wise, I don't have numbers, but it feels subjectively faster than Opus in Claude Code. YMMV.

Once you go above "a used 3090 at a decentish price", then I strongly recommend renting cloud GPUs or at least testing models using paid APIs. This allows testing your use case before spending piles of money.

phamilton 20 hours ago [ - ]

Generation is basically just memory bandwidth math.

Each token has to read all the active weights. I think that's around 40B parameters active. At a 4-bit quant that's 20GB. With 100GB/s (replace with whatever your bandwidth is) and you get 5 tokens per second.

SlavikCA 16 hours ago [ - ]

And with MTP (or other speculation techniques) you can ~double that.

phamilton 6 hours ago [ - ]

MTP on a MoE is hit or miss. If you're bottlenecked on memory, MTP can increase the number of active experts (like any batch processing would), which can eat away gains from it.