Activation would still require gigabytes for a few kb context.
There are plenty of techniques to optimise. But the question is what can an rtx 3080 train before OOM. The answer is not that much.
Can barely do quantized fine tuning. Even then, small context.
> Activation would still require gigabytes for a few kb context.
For that you use activation checkpointing, and you can also offload that to the CPU in a smart way to hide the latency. Although, yes, for long context training the activations do dominate the memory usage (and quantizing them degrades things more than just quantizing weights and/or optimizer states).