>It could run viably with SSD offload on Macs with very little memory
Not really. That's going to land you somewhere in the 0.2-0.5 tokens a second range
Lovely as modern nvmes are they're not memory
>It could run viably with SSD offload on Macs with very little memory
Not really. That's going to land you somewhere in the 0.2-0.5 tokens a second range
Lovely as modern nvmes are they're not memory
You can run multiple inferences in parallel on the same set of weights, that's what batching is. Given enough parallelization it can be almost entirely compute-limited, at least for small context (max ~10GB per request apparently, but that's for 1M tokens!)
Yes I think what this demonstrates that folks are missing is that now optimization for specific scenarios is quite possible.