would loading the KV cache from disk be faster than just recomputing it?
imo the discontinuous segments bit would not work because of the causal dependence in transformers + RoPE as you mention, but maybe could be possible
would loading the KV cache from disk be faster than just recomputing it?
imo the discontinuous segments bit would not work because of the causal dependence in transformers + RoPE as you mention, but maybe could be possible