Hacker News

So would 40x RPi 5 get 130 token/s?

I imagine it might be limited by number of layers and you'll get diminishing returns as well at some point caused by network latency.

It has to be 2^n nodes and limited to one per attention head that the model has.

Most likely not because of NUMA bottlenecks