Hacker News

Tepix 4 months ago [ - ]

Is it just me or is Nvidia trolling hard by calling a model with 30b parameters "nano"? With a bit of context, it doesn't even fit on a RTX 5090.

Other LLMs with the "nano" moniker are around 1b parameters or less.

patpatpat 4 months ago [ - ]

FWIW It runs on my 9060xt(AMD) 16gb, without any tweaks just fine. It's very useable. I asked it to write a prime sieve in c#, started responding in .38 seconds, and wrote an implementation @ 20 tokens/sec

Tepix 3 months ago [ - ]

But you're using a 3rd party quant of unknown quality. Nvidia is only providing weights as BF16 and FP8.

genpfault 4 months ago [ - ]

Getting ~150 tok/s on an empty context with a 24 GB 7900XTX via llama.cpp's Vukan backend.

Tepix 3 months ago [ - ]

Again, you're using some 3rd party quantisations, not the weights supplied by Nvidia (which don't fit in 24GB).

barrystaes 3 months ago [ - ]

I wonder what performance remains on 12GB VRAM GPU when local ollama ties in the systems RAM to run this huge nano model.

https://github.com/jameschrisa/Ollama_Tuning_Guide/blob/main...