I squeezed it into 24 GiB VRAM (since I have RX7900XTX):
-- Q5_K_M Unsloth quantization on Linux llama.cpp
-- context 81k, flash attention on, 8-bit K/V caches
-- pp 625 t/s, tg 30 t/s
I squeezed it into 24 GiB VRAM (since I have RX7900XTX):
-- Q5_K_M Unsloth quantization on Linux llama.cpp
-- context 81k, flash attention on, 8-bit K/V caches
-- pp 625 t/s, tg 30 t/s