I´m super curious about those "Two Weird Tricks". I would like that you would release more. It remember me the MiniMax Sparse Attention https://arxiv.org/html/2606.13392v1
I´m super curious about those "Two Weird Tricks". I would like that you would release more. It remember me the MiniMax Sparse Attention https://arxiv.org/html/2606.13392v1
Yeah, looks like fun stuff. You still need to preserve the entire kv cache though right? So even if compute is drastically less, memory keeps growing. The system I described keeps memory constant (well, if you keep the entire token history you technically are gaining one long of data per token generated but I think we can agree that is negligible and could be capped at something high like 1B or so with no meaningful impact). I think I will probably release trick one and see if people then believe trick two even without seeing it.