So would 40x RPi 5 get 130 token/s?
I imagine it might be limited by number of layers and you'll get diminishing returns as well at some point caused by network latency.
It has to be 2^n nodes and limited to one per attention head that the model has.
Most likely not because of NUMA bottlenecks
I imagine it might be limited by number of layers and you'll get diminishing returns as well at some point caused by network latency.
It has to be 2^n nodes and limited to one per attention head that the model has.
Most likely not because of NUMA bottlenecks