Does it support paged attention like vLLM though? Without that they will run into memory fragmentation quickly.
Yes, great question!
The system started without paged attention, and recreated its own paged attention implementation automatically once it realized it was a bottleneck.
Pretty cool!
Yes, great question!
The system started without paged attention, and recreated its own paged attention implementation automatically once it realized it was a bottleneck.
Pretty cool!