Hacker News

So for people wondering if it can be used to accelerate LLM inference, sadly not.

I've been trying to hit 100,000tokens/s with a 3.28m dumb model, and even this is an order of magnitude too large to benefit.

It appears to be focussed more on latency, than throughput. Happy to be corrected?

ssivark 9 hours ago [ - ]

When aiming for 100k tok/s, you would still have CUDA overheads (on the order of microseconds) -- which might become the bottleneck, even if you do everything else right with the inference architecture. How are you planning to overcome that?

EDIT: Oh, on second read, do you mean you're running the model on an FPGA?

taneq 8 hours ago [ - ]

You might be conflating throughput with latency. 100k tok/s is very different to 1 tok/10us.

ag2718 14 hours ago [ - ]

You're correct that this work is not very applicable for LLMs and that the focus here is primarily on latency.

ai_fry_ur_brain 11 hours ago [ - ]

Was anyone thinking this?