This (+ llmfit) are great attempts, but I've been generally frustrated by how it feels so hard to find any sort of guidance about what I would expect to be the most straightforward/common question:
"What is the highest-quality model that I can run on my hardware, with tok/s greater than <x>, and context limit greater than <y>"
(My personal approach has just devolved into guess-and-check, which is time consuming.) When using TFA/llmfit, I am immediately skeptical because I already know that Qwen 3.5 27B Q6 @ 100k context works great on my machine, but it's buried behind relatively obsolete suggestions like the Qwen 2.5 series.
I'm assuming this is because the tok/s is much higher, but I don't really get much marginal utility out of tok/s speeds beyond ~50 t/s, and there's no way to sort results by quality.
What is the $/Mtok that would make you choose your time vs savings of running stuff locally?
Just to be clear, it may sound like a snarky comment but I'm really curious from you or others how do you see it. I mean there are some batches long running tasks where ignoring electricity it's kind of free but usually local generation is slower (and worse quality) and we all kind of want some stuff to get done.
Or is it not about the cost at all, just about not pushing your data into the clouds.
Good question. I agree with what I think you're implying, which is that local generation is not the right choice if you want to maximize results per time/$ spent. In my experience, hosted models like Claude Opus 4.6 are just so effective that it's hard to justify using much else.
Nevertheless, I spend a lot of time with local models because of:
1. Pure engineering/academic curiosity. It's a blast to experiment with low-level settings/finetunes/lora's/etc. (I have a Cog Sci/ML/software eng background.)
2. I prefer not to share my data with 3rd party services, and it's also nice to not have to worry too much about accidentally pasting sensitive data into prompts (like personal health notes), or if I'm wasting $ with silly experiments, or if I'm accidentally poisoning some stateful cross-session 'memories' linked to an account.
3. It's nice to be able solve simple tasks without having to reason about any external 'side-effects' outside my machine.
For me it's a combination of privacy and wanting to be able to experiment as much as I want without limits. I'd happily take something that is 80% as good as SOTA but I can run it locally 24/7. I don't think there's anything out there yet that would 100% obviate my desire to at least occasionally fall back to e.g. Claude, but I think most of it could be done locally if I had infinite tokens to throw at it.
i can think of some tasks (classification, structured info extraction) that i _imagine_ even small meh models could do quite well at
on data i would never ever want to upload to any vendor if i can avoid it
Too generic question. Gotta be more specific:
Specific models & sizes for specific use cases on specific hardware at specific speeds.It’s a hard problem. I’ve been working on it for the better part of a year.
Well, granted my project is trying to do this in a way that works across multiple devices and supports multiple models to find the best “quality” and the best allocation. And this puts an exponential over the project.
But “quality” is the hard part. In this case I’m just choosing the largest quants.
Supporting all the various devices does sound quite challenging.
I wouldn't expect a perfect single measurement of "quality" to exist, but it seems like it could be approximated enough to at least be directionally useful. (e.g. comparing subsequent releases of the same model family)
LLMs are just special purpose calculators, as opposed to normal calculators which just do numbers and MUST be accurate. There aren't very good ways of knowing what you want because the people making the models can't read your mind and have different goals