3090 and 2x3090 are quite popular. But if you uses gigantic (for local models) context of 200k it will go south pretty quickly - any quantization of context quickly becomes the issue.

I think that's quite telling Gorgi replied that he uses Qwen with 131k context.

https://x.com/ggerganov/status/2067539416436867230?s=20

We also use it with 200-256k (native) context length.

The issue could be that folks that don't see looping aren't pushing the model as hard, or as enthusiastically.

We also had far fewer issues when thinking was turned off, than with a reasoning budget capped at 2048.

Some fine-tunes like Qwopus-Coder just seem prone to looping - google it, you'll see plenty of reports, even on Reddit.

For what it's worth seen the RTX 6000 Pro loop even at fp16 on the KV cache - and with vLLM.