Hacker News

porphyra 3 days ago [ - ]

Groq doesn't even have any true deepseek models --- I thought they only had `deepseek-r1-distill-llama-70b` which was distilled onto llama 70b [1].

[1] https://console.groq.com/docs/models

jacob019 3 days ago [ - ]

Groq has a weak selection of models, which is frustrating because their inference speed is insane. I get it though, selection + optimization = performance.

jbentley1 3 days ago [ - ]

From conversation with someone from Groq, they have a custom compiler and runtime for the models to run on their custom hardware, which is why the selection is poor. For every model type they need to port the architecture to run on their compiler beforehand.

boroboro4 3 days ago [ - ]

They can't host DeepSeek because it's too big. Their chips have 230mb of memory, so it will take them ~3000 chips to host the model + (possible large) number of chips to keep kv cache. I bet it's just too hard to bring such topology online at all, and impossible to make even near to be profitable.

sergiotapia 3 days ago [ - ]

the only reason they are fast is because the models they host are severely quantized so i've heard.

jacob019 3 days ago [ - ]

Huh. I heard a podcast with the founder talking about their custom hardware, but quantization would explain it.

christianqchung 3 days ago [ - ]

Quantization alone does not explain it. It's mostly custom hardware[0].

[0] https://groq.com/the-groq-lpu-explained/

zargon 3 days ago [ - ]

Why repeat this nonsense when it’s so trivial to just check. The reason Groq is fast is because they employ absolutely ludicrous amounts of SRAM. (Which is 10 times faster than the fastest VRAM.)

behnamoh 3 days ago [ - ]

they responded to my tweet last year and said they didn't quantize the models.

boroboro4 3 days ago [ - ]

It's very hard to find right now but I'm sure they said they don't quantize KV cache, but their weights are in fp8.