Yeah, looks like fun stuff. You still need to preserve the entire kv cache though right? So even if compute is drastically less, memory keeps growing. The system I described keeps memory constant (well, if you keep the entire token history you technically are gaining one long of data per token generated but I think we can agree that is negligible and could be capped at something high like 1B or so with no meaningful impact). I think I will probably release trick one and see if people then believe trick two even without seeing it.