Getting ~150 tok/s on an empty context with a 24 GB 7900XTX via llama.cpp's Vukan backend.
Again, you're using some 3rd party quantisations, not the weights supplied by Nvidia (which don't fit in 24GB).
Again, you're using some 3rd party quantisations, not the weights supplied by Nvidia (which don't fit in 24GB).