For 1T Q4 - 1 token generated per every ~500GB memory read. So you'll need something like ~10TB/s memory for 20t/s. This is 8x5090 speed area and 16x5090 size area. HBM4 will bring us close to something really possible in home lab, but it will cost fortune for early adopters.
Speculative decoding/DFlash will help with it, but YMMV.
Edit: Missed a part that this is A32B MoE, which means it drastically reduces amount of reads needed. Seems 20 t/s should be doable with 1TB/s memory (like 3090)