Hacker News

Basic conversations are essentially RP I suppose. You can look at KoboldCPP or SillyTavern reddit.

I was trying Patricide unslop mell and some of the Qwen ones recently. Up to a point more params is better than worrying about quantization. But eventually you'll hit a compute wall with high params.

KV cache quantization is awesome (I use q4 for a 32k context with a 1080ti!) and context shifting is also awesome for long conversations/stories/games. I was using ooba but found recently that KoboldCPP not only runs faster for the same model/settings but also Kobold's context shifting works much more consistently than Ooba's "streaming_llm" option, which almost always re-evaluates the prompt when hooked up to something like ST.