Hacker News

EnPissant 9 hours ago [ - ]

Streaming weights from RAM to GPU for prefill makes sense due to batching and pcie5 x16 is fast enough to make it worthwhile.

Streaming weights from RAM to GPU for decode makes no sense at all because batching requires multiple parallel streams.

Streaming weights from SSD _never_ makes sense because the delta between SSD and RAM is too large. There is no situation where you would not be able to fit a model in RAM and also have useful speeds from SSD.

inventor7777 an hour ago [ - ]

On Apple Silicon Macs, the RAM is shared. So while maybe not up to raw GPU VRAM speeds, it still manages over 450GB/s real world on M4 Pro/Max series, to any place that it is needed.

They all do have a limitation from the SSD, but the Apple SSDs can do over 17GB/s (on high end models, the more normal ones are around 8GB/s)

simonw 9 hours ago [ - ]

There have been some very interesting experiments with streaming from SSD recently: https://simonwillison.net/2026/Mar/18/llm-in-a-flash/

5 hours ago [ - ]

[deleted]