Hacker News

The CD-ROM streaming approach is the real insight here, keeping only activations and KV cache in RAM and streaming weights one matrix at a time sidesteps the 32MB constraint entirely. It's essentially the same trick modern edge inference does with flash storage, just on hardware from 2000. Curious about the latency profile, with CD-ROM read speeds around 1.6 MB/s on PS2, the 77MB SmolLM2 model being too slow makes sense, but how does the 10MB brandon-tiny feel in practice? Are you getting tokens per minute or more like tokens per several seconds? Also interested in the custom PSNT format decision, was the main motivation the PS2's MIPS alignment constraints, or was there something about the existing GGUF/llama.c formats that made them impractical to parse on the Emotion Engine?