You still need to hold the model in memory. If you have for example 16 GB ram, the gains aren't that much

That's not what consumes the most memory at scale. The KV caches are per-user.