If you want to run LLMs locally then the localllama community is your friend: https://old.reddit.com/r/LocalLLaMA/
In general there's no "best" LLM model, all of them will have some strengths and weaknesses. There are a bunch of good picks; for example:
> DeepSeek-R1-0528-Qwen3-8B - https://huggingface.co/deepseek-ai/DeepSeek-R1-0528-Qwen3-8B
Released today; probably the best reasoning model in 8B size.
> Qwen3 - https://huggingface.co/collections/Qwen/qwen3-67dd247413f0e2...
Recently released. Hybrid thinking/non-thinking models with really great performance and plethora of sizes for every hardware. The Qwen3-30B-A3B can even run on CPU with acceptable speeds. Even the tiny 0.6B one is somewhat coherent, which is crazy.
Yes at this point it's starting to become almost a matter of how much you like the model's personality since they're all fairly decent. OP just has to start downloading and trying them out. With 16GB one can do partial DDR5 offloading with llama.cpp and run anything up to about 30B (even dense) or even more at a "reasonable" speed for chat purposes. Especially with tensor offload.
I wouldn't count Qwen as that much of a conversationalist though. Mistral Nemo and Small are pretty decent. All of Llama 3.X are still very good models even by today's standards. Gemma 3s are great but a bit unhinged. And of course QwQ when you need GPT4 at home. And probably lots of others I'm forgetting.
There was this great post the other day [1] showing that with llama-cpp you could offload some specific tensors to the CPU and maintain good performance. That's a good way to use lare(ish) models in commodity hardware.
Normally with llama-cpp you specifiy how many (full) layers you want to put in GPU (-ngl) . But CPU-offloading specific tensors that don't require heavy computation , saves GPU space without affecting speed that much.
I've also read a paper on loading only "hot" neurons into the cpu [2] . The future of home AI looks so cool!
[1] https://www.reddit.com/r/LocalLLaMA/comments/1ki7tg7/dont_of...
[2] https://arxiv.org/abs/2312.12456
> DeepSeek-R1-0528-Qwen3-8B https://huggingface.co/deepseek-ai/DeepSeek-R1-0528-Qwen3-8B ... Released today; probably the best reasoning model in 8B size.
Wild how effective distillation is turning out to be. No wonder, most shops have begun to "hide" CoT now: https://news.ycombinator.com/item?id=41525201> Beyond its improved reasoning capabilities, this version also offers a reduced hallucination rate, enhanced support for function calling, and better experience for vibe coding.
Thank you for thinking of the vibe coders.
> If you want to run LLMs locally then the localllama community is your friend: https://old.reddit.com/r/LocalLLaMA/
For folks new to reddit, it's worth noting that LocalLlama, just like the rest of the internet but especially reddit, is filled with misinformed people spreading incorrect "facts" as truth, and you really can't use the upvote/downvote count as an indicator of quality or how truthful something is there.
Something that is more accurate but put in a boring way will often be downvoted, while straight up incorrect but funny/emotional/"fitting the group think" comments usually get upvoted.
For us who've spent a lot of time on the web, this sort of bullshit detector is basically built-in at this point, but if you're new to places where the group think is so heavy as on reddit, it's worth being careful taking anything at face value.
This is entirely why I can't bring myself to use it. The groupthink and virtue signaling is intense, when it's not just extremely low effort crud that rises to the top. And yes, before anyone says, I know, "curate." No, thank you.
Friend, this website is EXACTLY the same
I understand that the core similarities are there, but I disagree. The comparisons have been around since I started browsing HN years ago. The moderation on this site, for one, emphasizes constructive conversation and discussion in a way that most subreddits can only dream of.
It also helps that the target audience has been filtered with that moderation, so over time this site (on average) skews more technical and informed.
This sites commenters attempt to apply technical solutions to social problems, then pats itself on the back despite their comments being entirely inappropriate to the problem space.
There's also no actual constructive discussion when it comes to future looking tech. The Cybertruck, Vision Pro, LLMs are some of the most recent items that were absolutely inaccurately called by the most popular comments. And their reasoning for their prediction had no actual substance in their comments.
And the crypto asset discussions are very nontechnical here, veering into elementary and inaccurate philosophical discussions, despite this being a great forum to talk about technical aspects. every network has pull requests and governance proposals worth discussing, and the deepest discussion here is resurrected from 2012 about the entire concept not having a licit use case that the poster could imagine
HackerNews isn't not exactly like reddit, sure, but it's not much better. People are much better behaved, but still spread a great deal of misinformation.
One way to gauge this property of a community is whether people who are known experts in a respective field participate in it, and unfortunately there are very few of them on HackerNews (this was not always the case). I've had some opportunities to meet with people who are experts, usually at conferences/industry events, and while many of them tend to be active on Twitter... they all say the same things about this site, namely that it's simply full of bad information and the amount of effort needed to dispel that information is significantly higher than the amount of effort needed to spread it.
Next time someone posts an article about a topic you are intimately familiar with, like top 1% subject matter expert in... review the comment section for it and you'll find just heaps of misconceptions, superficial knowledge, and my favorite are the contrarians who take these very strong opinions on a subject they have some passing knowledge about but talk about their contrarian opinion with such a high degree of confidence.
One issue is you may not actually be a subject matter expert on a topic that comes up a lot on HackerNews, so you won't recognize that this happens... but while people here are a lot more polite and the moderation policies do encourage good behavior... moderation policies don't do a lot to stop the spread of bad information from poorly informed people.
One of the things I appreciate most about HN is the fact that experts are often found in the comments.
Perhaps we are defining experts differently?
There was a lot of pseudo science being published and voted up in the comments with Ivermectin/HCQ/etc and Covid, when those people weren't experts and before the Ivermectin paper got serious scrutiny.
The other aspect is that people on here think they're that if they are an expert in one thing, they instantly become an expert in another thing.
> There was a lot of pseudo science being published and voted up in the comments with Ivermectin
Was there? To me it look like HN comments were skeptical way before the public even knew what the drug was.
https://news.ycombinator.com/item?id=22873687
This is of course true is some cases and less true in others.
I consider myself an expert in one tiny niche field (computer generated code), and when that field comes up (on HN and elsewhere) over the last 30 years the general mood (from people who don't do it) is that it's poor quality code.
Pre-AI this was demonstrably untrue, but meh, I don't need to convince you, so I accept your point of view, and continue doing my thing. Our company revenue is important to me, not the opinion of done guy on the internet.
(AI has freshened the conversation, and it is currently giving mixed results, which is to be expected since it is non-deterministic. But I've been doing deterministic generation for 35 years.)
So yeah. Lots of comments from people who don't fo something, and I'm really not interested in taking the time to "prove" them wrong.
But equally I think the general level of discussion in areas where I'm not an expert (but experienced) is high. And around a lot of topics experience can be highly different.
For example companies, employees and employers come in all sorts of ways. Some folk have been burned and see (all) management through a certain light. Whereas of course, some are good, some are bad.
Yes, most people still use voting as a measure of "I agree with this", rather than the quality of the discussion, but that's just people, and I'm not gonna die on that hill.
And yeah, I'm not above joining in on a topic I don't technically use or know much about. I'll happily say that the main use for crypto (as a currency) is for illegal activity. Or that crypto in general is a ponzi scheme. Maybe I'm wrong, maybe it really is the future. But for now, it walks like a duck.
So I both agree, and disagree, with you. But I'm still happy to hang out here and get into (hopefully) illuminating discussions.
Do you have any sources to back up those claims?
Frankly, no. As an obvious example that can be stated nowadays: musk has always been an over-promising liar.
Eg just look at the 2012+ videos of thunderf00t.
Yet people were literally banned here just for pointing out that he hasn't actually delivered on anything in the capacity he promised until he did the salute.
It's pointless to list other examples, as this page is- as dingnuts pointed out - exactly the same and most people aren't actually willing to change their opinion based on arguments. They're set in their opinions and think everyone else is dumb.
> Yet people were literally banned here just for pointing out that he hasn't actually delivered on anything in the capacity he promised until he did the salute.
I'd be shocked if they (you?) were banned just for critiquing Musk. So please link the post. I'm prepared to be shocked.
I'm also pretty sure that I could make a throwaway account that only posted critiques of Musk (or about any single subject for that matter) and manage to keep it alive by making the critiques timely, on-topic and thoughtful or get it banned by being repetitive and unconstructive. So would you say I was banned for talking about <topic>? Or would you say I was banned for my behavior while talking about <topic>?
[dead]
Aside from the fact that I highly doubt anyone was banned as you describe, EM’s stories have gotten more and more grandiose. So it’s not the same.
Today he’s pitching moonshot projects as core to Tesla.
10 years ago he was saying self-driving was easy, but he was also selling by far the best electric vehicle on the market. So lying about self driving and Tesla semis mattered less.
Fwiw I’ve been subbed to tf00t since his 50 part creationist videos in early 2010s.
I don’t see how that example refutes their point. It can be true both that there have been disagreeable bans and that the bans, in general, tend to result in higher quality discussions. The disagreeable bans seem to be outliers.
> They're set in their opinions and think everyone else is dumb.
Well, anyway, I read and post comments here because commenters here think critically about discussion topics. It’s not a perfect community with perfect moderation but the discussions are of a quality that’s hard to find elsewhere, let alone reddit.
Strongly disagree.
Scroll to the bottom of comment sections on HN, you’ll find the kind of low-effort drive-by comments that are usually at the top of Reddit comment sections.
In other words, it helps to have real moderators.
While the tone on HN is much more civil than on Reddit. It's still quite the echo chamber.
> It's still quite the echo chamber.
We have people from every political spectrum here and they don't get banned, how can it be an echo chamber?
We have communists, free market absolutists and so on all arguing in the same comment section. I don't think I have seen any thread with many comments where both (or more) sides weren't represented.
> We have people from every political spectrum here and they don't get banned, how can it be an echo chamber?
I mean sure, they may not get banned, but start voicing an opinion that it even moderately against the status quo and you get downvoted to oblivion.
> > . . . The groupthink and virtue signaling is intense . . .
> Friend, this website is EXACTLY the same
And it gnows it: https://news.ycombinator.com/item?id=4881042
It happens in degrees, and the degree here is much lower.
I disagree. Reddit users are out to impress nobody but themselves, but the other day I saw someone submit a "Show HN" with AI-generated testimonials.
HN has an active grifter culture reinforced by the VC funding cycles. Reddit can only dream about lying as well as HN does.
That's a tangential problem.
HN tends to push up grifter hype slop, and there are a lot of those people around cause VC, but you can still see comments pushing back.
Reading reddit reminds me of highschool forum arguments I've had 20 years ago, but lower quality because of population selection. It's just too mainstream at this point and shows you what the middle of the bell curve looks like.
its actually the reverse, dunning kruger is off the charts on hacker news
I don't think there's a lot of groupthink or virtue signaling here, and those are the things that irritate me the most. If people here overestimate their knowledge or abilities, that's okay because I don't treat things people say as gospel/fact/truth unless I have clear and compelling reasons to do so. This is the internet after all.
Personally I also think the submissions that make it to the front page(s) are much better than any subreddit.
[dead]
Strong disagree as well, this is one of the few places on the Internet which avoids this. I wish there were more
LocalLlama is good for:
- Learning basic terms and concepts.
- Learning how to run local inference.
- Inference-level considerations (e.g., sampling).
- Pointers to where to get other information.
- Getting the vibe of where things are.
- Healthy skepticism about benchmarks.
- Some new research; there have been a number of significant discoveries that either originated in LocalLlama or got popularized there.
LocalLlama is bad because:
- Confusing information about finetuning; there's a lot of myths from early experiments that get repeated uncritically.
- Lots of newbie questions get repeated.
- Endless complaints that it's been took long since a new model was released.
- Most new research; sometimes a paper gets posted but most of the audience doesn't have enough background to evaluate the implications of things even if they're highly relevant. I've seen a lot of cutting edge stuff get overlooked because there weren't enough upvoters who understood what they were looking at.
> Most new research; sometimes a paper gets posted but most of the audience doesn't have enough background to evaluate the implications of things even if they're highly relevant. I've seen a lot of cutting edge stuff get overlooked because there weren't enough upvoters who understood what they were looking at.
Is there a good place for this? Currently I just regularly sift through all of the garbage myself on arxiv to find the good stuff, but it is somewhat of a pain to do.
I am not aware of any central, public location; I use a combination of arxiv, Google Scholar, Huggingface posts, and private Discords. The problem is that most of the public space has been poisoned with AI hype, and it's nearly impossible to find more than surface-level introductions for a lot of topics because the substandard Medium posts and YouTube hypemongers drowns them out.
Having a background in machine learning helps, because at least I can search for terminology that hasn't been picked up by the hype machine yet.
There's some communities that are more niche; more academically focused Discords or groups where there's better discussion going on, basically. Those are intermittent enough that you can't expect ongoing general discussion and for most I'd have to go back and check if they're still worth reading past the one discussion I found useful.
But for the wider internet, the hype train has forced most of the informed discourse off the road.
Lol this is true but also a TON of sampling innovations that are getting their love right now from the AI community (see min_p oral at ICLR 2025) came right from r/localllama so don't be a hater!!!
Poster: https://iclr.cc/media/PosterPDFs/ICLR%202025/30358.png?t=174...
Well the unfortunate truth is HN has been behind the curve on local llm discussions so localllama has been the only one picking up the slack. There are just waaaaaaaay to many “ai is just hype” people here and the grassroots hardware/localllm discussions have been quite scant.
Like, we’re fucking two years in and only now do we have a thread about something like this? The whole crowd here needs to speed up to catch up.
There are people who think LLMs are the future and a sweeping change you must embrace or be left behind.
There are others wondering if this is another hype juggernaut like CORBA, J2EE, WSDL, XML, no-SQL, or who-knows-what. A way to do things that some people treated as the new One True Way, but others could completely bypass for their entire, successful career and look at it now in hindsight with a chuckle.
And like those technologies it will find its own niches (like XML and no-SQL being used heavily in the publishing, standards, and other similar industries using document formats such as JATS) or fade away to be replaced with something else that fills the void (like CORBA and WSDL being replaced by other technologies).
I think LLMs will find their uses, it just takes time to distil what they are really useful for vs what the AI companies are generating hype for.
For example, I think they can be used to create better auto-complete by giving them the context information (matching functions, etc.) and letting them generate the completion text from that.
Are you trolling?
Huh? What part of my reply was trolling?
No. I was not trolling. If you explain why you think I'm trolling I could provide a better response to your generic reply.
Are you trolling?
I use it as a discovery tool. Like if anybody mentions something interesting I go and research install/start playing with it. I could care less if they like it or not I'll make my own opinion.
For example I find all comments about model X be more "friendly" or "chatty" and model Y being more "unhinged" or whatever to be mostly BS. Like there's gazillion ways a conversation can go and I don't find model X or Y to be consistently chatty or unhinged or creative or whatever every time.
I'd also recommend you go with something like 8b, so you can have the other 8GB of vram for a decent sized context window. There's tons of good 8b ones, as mentioned above. If you go for the largest model you can fit, you'll have slower inference (as you pass in more tokens) and smaller context.
I think your recommendation falls within
> all of them will have some strengths and weaknesses
Sometimes a higher parameter model with less quantization and low context will be the best, sometimes lower parameter model with some quantization and huge context will be the best, sometimes high parameter count + lots of quantization + medium context will be the best.
It's really hard to say one model is better than another in a general way, since it depends on so many things like your use case, the prompts, the settings, quantization, quantization method and so on.
If you're building/trying to build stuff depending on LLMs in any capacity, the first step is coming up with your own custom benchmark/evaluation that you can run with your specific use cases being put under test. Don't share this publicly (so it doesn't end up in the training data) and run it in order to figure out what model is best for that specific problem.
8b is the number of parameters. The most common quant is 4 bits per parameter so 8b params is roughly 4GB of VRAM. (Typically more like 4.5GB)
The number of quantized bits is a trade off between size and quality. Ideally you should be aiming for a 6-bit or 5-bit model. I've seen some models be unstable at 4-bit (where they will either repeat words or start generating random words).
Anything below 4-bits is usually not worth it unless you want to experiment with running a 70B+ model -- though I don't have any experience of doing that, so I don't know how well the increased parameter size balances the quantization.
See https://github.com/ggml-org/llama.cpp/pull/1684 and https://gist.github.com/Artefact2/b5f810600771265fc1e3944228... for comparisons between quantization levels.
> The number of quantized bits is a trade off between size and quality. Ideally you should be aiming for a 6-bit or 5-bit model. I've seen some models be unstable at 4-bit (where they will either repeat words or start generating random words).
Note that that's a skill issue of whoever quantized the model. In general quantization even as low as 3-bit can be almost loseless when you do quantization-aware finetuning[1] (and apparently you don't even need that many training tokens), but even if you don't want to do any extra training you can be smart as to which parts of the model you're quantizing and by how much to minimize the damage (e.g. in the worst case over-quantizing even a single weight can have disastrous consequences[2])
Some time ago I ran an experiment where I finetuned a small model while quantizing parts of it to 2-bits to see which parts are most sensitive (the numbers are the final loss; lower is better):
So as you can see quantizing some parts of the model affects it more strongly. The downprojection in the MLP layers is the most sensitive part of the model (which also matches with what [2] found), so it makes sense to quantize this part of the model less and instead quantize other parts more strongly. But if you'll just do the naive "quantize everything in 4-bit" then sure, you might get broken models.[1] - https://arxiv.org/pdf/2502.02631 [2] - https://arxiv.org/pdf/2411.07191
Interesting. I was aware of using an imatrix for the i-quants but didn't know you could use them for k-quants. I've not experimented with using imatrices in my local setup yet.
And it's not a skill issue... it's the default behaviour/logic when using k-quants to quantize a model with llama.cpp.
With a 16GB GPU you can comfortably run like Qwen3 14B or Mistral Small 24B models at Q4 to Q6 and still have plenty of context space and get much better abilities than an 8B model.
Can system RAM be used for context (albeit at lower parsing speeds)?
Yeah, but it sucks. In fact, if you get the wrong graphics card and the memory bandwidth/speeds suck, things will suck too, so RAM is even worse (other than m1/m2/m3 stuff).
What if one is okay with walking away and letting it run (crawl)? Then context can be 128 or 256GB system RAM while the model is jammed into what precious VRAM there is?
I’m curious (as someone who knows nothing about this stuff!)—the context window is basically a record of the conversation so far and other info that isn’t part of the model, right?
I’m a bit surprised that 8GB is useful as a context window if that is the case—it just seems like you could fit a ton of research papers, emails, and textbooks in 2GB, for example.
But, I’m commenting from a place of ignorance and curiosity. Do models blow up the info in the context window, maybe do some processing to pre-digest it?
Yes, every token is expanded into a vector that can be many thousand of dimensions. The vectors are stored for every token and every layer.
You absolutely can not fill even a single research paper in 2 GB much less an entire book.
What do you recommend for coding with aider or roo?
Sometimes it’s hard to find models that can effectively use tools
I havent found one good locally, i use DeepSeek r1 0528 its slow but free and really good at coding (openrouter has it free currently)
Oh wow, just checked this leaderboard, r1 0528 looks really good
https://leaderboard.techfren.net/
Thanks for the recommendation, will try it out
> Released today; probably the best reasoning model in 8B size.
Actually DeepSeek-R1-0528-Qwen3-8B was uploaded Thursday (yesterday) at 11 AM UTC / 7 PM CST. I had to check if a new version came out since! I am waiting for the other sizes! ;D