My biggest pet peeve with all these articles on local AI is the only thing they talk about is tokens per second. No one mentions the quality of the answers. No one. I don't mind waiting a little longer if the quality is better. Quickly serving me slop doesn't make it more useful. Are people really only looking at tokens per second?
The model already has its own quality benchmarks elsewhere. The article is just about running the model on X hardware, so the remaining question is then how fast it is. Or does the output quality somehow depend on the hardware too?
The quality is obviously much worse, but still useful as a reference if you generally know what you are doing
It solve the "I'm coding on the plane and need to look up this thing I've forgotten" problem, for me at least
Local model as such will give you "autocomplete on steroids" but it is not going to run away and implement cross project feature like frontier model in let's say Cursor.
So there is no value in testing quality of answers, but there is value in testing token speed.
You just have to have correct expectations.
Is autocomplete using LLMs really useful? Even with frontier models I found it to be about 50% right, I turned it of and prefer to use IntelliJ built-in, it is way more reliable.
For me local models is all about quality, and how to achieve that - e.g. by providing guardrails that test the job done.
That's fair. There are even many dimensions to define 'quality' which include use case (coding? writing? multimedia?) and prompt. I suppose if you ask testers to provide benchmarks with their analysis, that might hamper their desire to share.