What is everyone using their local LLMs for primarily? Unless you have a beefy machine, you'll never approach the level of quality of proprietary models like Gemini or Claude, but I'm guessing these smaller models still have their use cases, just not sure what those are.
Not everyone is comfortable with sending their data and/or questions and prompts to an external party.
Especially now that a court has ordered OpenAI to keep records of it all.
https://www.adweek.com/media/a-federal-judge-ordered-openai-...
I generally try a local model first for most prompts. It's good enough surprisingly often (over 50% for sure). Every time I avoid using a cloud service is a win.
I think that the future of local LLMs is delegation. You give it a prompt and it very quickly identifies what should be used to solve the prompt.
Can it be solved locally with locally running MCPs? Or maybe it's a system API - like reading your calendar or checking your email. Otherwise it identifies the best cloud model and sends the prompt there.
Basically Siri if it was good
I completely disagree. I don't see the current status quo fundamentally changing.
That idea makes so much sense on paper, but until you start implementing it that you realized why no one does it (including Siri). "Some tasks are complex and better suited for complex giant model, but small models are perfectly capable of running simple limited task" makes a ton of sense, but the component best equipped at evaluating that decision is the smarter component of your system. At which point, you might as well have had it run the task.
It's like assigning the intern to triage your work items.
When actually implementing the application with that approach, every time you encounter an "AI-miss" you would (understandably) blame the small model, and eventually give up and delegate yet-another-scnario to the cloud model.
Eventually you feel you're artificially handcuffing yourself compared to literally every body else trying to ship something utilizing a 1b model. You have the worst of all options, crappy model with lots of hiccups yet it's still (by far) the most resource intensive part of your application making the whole thing super heavy and you are delegating more and more to the cloud model.
The local LLM scenario is going to be entirely driven by privacy concerns (around which there is no option. It's not like an E2EE LLM API could exist) or cost concerns if you believe you can run it cheaper.
Doesn’t this ignore that some data may be privileged/too big to send to the cloud. Perhaps i have my health records in Apple Health and Kaiser Permanente. You can imagine it being okay to be accessed locally, but not sent up to the cloud
I’m confused. Your Apple Health or Kaiser Permanente data is already stored on the cloud. It’s not like it’s only ever store locally and if you lost your phone you lost your Apple Health or Kaiser Permanente data.
I already mentioned privacy being the only real concern, but it won’t be really the end user privacy. At least that particular concern isn’t the ball mover people’s comments here would make you think it is. Plenty of people are storing their medical information in Google drives and Gmail attachments already. If end user privacy from “the cloud” was actually a thing, you would have seen that reflected in the market.
The privacy concerns that are of importance are that of organizations.
I'm currently experimenting with Devstral for my own local coding agent I've slowly built together. It's in many ways nicer than Codex in that 1) full access to my hardware so can start VMs, make network requests and everything else I can do, which Codex cannot and 2) it's way faster both in initial setup, working through things and creating a patch.
Of course, it still isn't at the same level as Codex itself, the model Codex is using is just way better so of course it'll get better results. But Devstral (as I currently use it) is able to make smaller changes and refactors, and I think if I evolve the software a bit more, can start making larger changes too.
Why are you comparing it to Codex and not Claude Code, which can do all those things?
And why not just use Openhands, which it was designed around which I presume can also do all those things?
> unless you have a beefy machine
The average person in r/locallama has a machine that would make r/pcmasterrace users blush.
An Apple M1 is decent enough for LMs. My friend wondered why I got so excited about it when it came out five years ago. It wasn't that it was particularly powerful - it's decent. What it did was to set a new bar for "low end".
A new Mac is easily starting around $1k and quickly goes up from there if you want a storage or RAM upgrade, especially for enough memory to really run some local models. Insane that a $1,000 computer is called "decent" and "low end". My daily driver personal laptop brand new was $300.
An M1 Mac is about 5 years old at this point and can be had for far less than a grand.
A brand new Mac Mini M4 is only $499.
Ah, I was focusing on the laptops, my bad. But still its more than $499. Just looked on the Apple store website, Mac Mini M4 starting at $599 (not $499), with only 256GB of storage.
https://www.apple.com/shop/buy-mac/mac-mini/m4
microcenter routinely sells that system for $450.
https://www.microcenter.com/product/688173/Mac_mini_MU9D3LL-...
That's fun to hear given that low end laptops are now $800, mid range is like $1.5k and upper end is $3k+ even for non-Apple vendors. Inflation makes fools of us all.
Low end laptops can still easily be found for far less than $800.
https://www.microcenter.com/product/676305/acer-aspire-3-a31...
The first IBM PC in 1981 cost $1,565, which is comparable to $5,500 after inflation.
Of course it depends on what you consider "low end" - it's relative to your expectations. I have a G4 TiBook, the definition of a high-end laptop, by 2002 standards. If you consider a $300 laptop a good daily driver, I'll one-up you with this: <https://www.chrisfenton.com/diy-laptop-v2/>
My $300 laptop is a few years old. It has a Ryzen 3 3200U CPU, it has a 14" 1080p display, backlit keyboard. It came with 8GB of RAM and a 128GB SSD, I upgraded to 16GB from RAM acquired by a dumpster dive and a 256GB SSD for like $10 on clearance at Microcenter. I upgraded the WiFi to an Intel AX210 6e for about another $10 off Amazon. It gets 6-8 hours of battery life doing browsing and texting editing kind of workloads.
The only thing that is itching me to get a new machine is it needs a 19V power supply. Luckily it's a pretty common barrel size, I already had several power cables laying around that work just fine. I'd prefer to just have all my portable devices to run off USB-C though.
I know I speak for everyone that your dumpster laptop is very impressive, give yourself a big pat on the back. You deserve it.
You're right - memory size and then bandwidth is imperative for LLMs. Apple currently lacks great memory bandwidth with their unified memory. But it's not a bad option if you can find one for a good price. The prices for new are just bonkers.
I avoid using cloud whenever I can on principle. For instance, OpenAI recently indicated that they are working on some social network-like service for ChatGPT users to share their chats.
Running it locally helps me understand how these things work under the hood, which raises my value on the job market. I also play with various ideas which have LLM on the backend (think LLM-powered Web search, agents, things of that nature), I don't have to pay cloud providers, and I already had a gaming rig when LLaMa was released.
General local inference strengths:
- Experiments with inference-level control; can't do the Outlines / Instructor stuff with most API services, can't do the advanced sampling strategies, etc. (They're catching up but they're 12 months behind what you can do locally.)
- Small, fast, finetuned models; _if you know what your domain is sufficiently to train a model you can outperform everything else_. General models usually win, if only due to ease of prompt engineering, but not always.
- Control over which model is being run. Some drift is inevitable as your real-world data changes, but when your model is also changing underneath you it can be harder to build something sustainable.
- More control over costs; this is the classic on-prem versus cloud decision. Most cases you just want to pay for the cloud...but we're not in ZIRP anymore and having a predictable power bill can trump sudden unpredictable API bills.
In general, the move to cloud services was originally a cynical OpenAI move to keep GPT-3 locked away. They've built up a bunch of reasons to prefer the in-cloud models (heavily subsidized fast inference, the biggest and most cutting edge models, etc.) so if you need the latest and greatest right now and are willing to pay, it's probably the right business move for most businesses.
This is likely to change as we get models that can reasonably run on edge devices; right now it's hard to build an app or a video game that incidentally uses LLM tech because user revenue is unlikely to exceed inference costs without a lot of careful planning or a subscription. Not impossible, but definitely adds business challenges. Small models running on end-user devices opens up an entirely new level of applications in terms of cost-effectiveness.
If you need the right answer, sometimes only the biggest cloud API model is acceptable. If you've got some wiggle room on accuracy and can live with sometimes getting a substandard response, then you've got a lot more options. The trick is that the things that an LLM is best at are always going to be things where less than five nines of reliability are acceptable, so even though the biggest models have more reliability, an average there are many tasks where you might be just fine with a small fast model that you have more control over.
This is an excellent example of local LLM application [1].
It's an AI-driven chat system designed to support students in the Introduction to Computing course (ECE 120) at UIUC, offering assistance with course content, homework, or troubleshooting common problems.
It serves as an educational aid integrated into the course’s learning environment using UIUC Illinois Chat system [2].
Personally I've found it's really useful that it provides the details portions of course study materials for examples slides that's directly related to the discussions so the students can check the sources veracity of the answers provided by the LLM.
It seems to me that RAG is the killer feature for local LLM [3]. It directly addressed the main pain point of LLM hallucinations and help LLMs stick to the facts.
[1] Introduction to Computing course (ECE 120) Chatbot:
https://www.uiuc.chat/ece120/chat
[2] UIUC Illinois Chat:
https://uiuc.chat/
[3] Retrieval-augmented generation [RAG]:
https://en.wikipedia.org/wiki/Retrieval-augmented_generation
Does this actually need to be local? Since the chat bot is open to the public and I assume the course material used for RAG all on this page (https://canvas.illinois.edu/courses/54315/pages/exam-schedul...) all stays freely accessible - I clicked a few links without being a student - I assume a pre-prompted larger non-local LLM would outperform the local instance. Though, you can imagine an equivalent course with all of its content ACL-gated/'paywalled' could benefit from local RAG, I guess.
You still can get decent stuff out of local ones.
Mostly I use it for testing tools and integrations via API not to spend money on subscriptions. When I see something working I switch it to proprietary one to get best results.
If you're comfortable with the API, all the services provide pay-as-you-go API access that can be much cheaper. I've tried local, but the time cost of getting it to spit out something reasonable wasn't worth the literal pennies the answers from the flagship would cost.
This. The APIs are so cheap and they are up and running right now with 10x better quality output. Unless whatever you are doing is Totally Top Secret or completely nefarious, then send your prompts to an API.
I don’t see too much time spent to respond. I have above average hardware but nothing ultra fancy and I get decent response times from something like LLAMA 3.x. Maybe I am just happy with not instant replies but from online models O do not get replies much faster.
> but from online models O do not get replies much faster.
My point is that the raw token/second isn't all that matters. The tokens/second required for the correct/acceptable quality result is what actually matters. From my experience, the large LLM will almost always one shot an answers that takes many back-and-forth iterations/revisions from LLAMA 3.x. With higher reasoning tasks, you might spend many iterations only to realize the small model isn't capable of providing an answer, but the large model could after a few iterations. That wasted time is usually only pennies, if you would have just started with the large model.
Of course, it matters what you're actually doing.
If you look on localllama you'll see most of the people there are really just trying to do NSFW or other questionable or unethical things with it.
The stuff you can run on reasonable home hardware (e.g. a single GPU) isn't going to blow your mind. You can get pretty close to GPT3.5, but it'll feel dated and clunky compared to what you're used to.
Unless you have already spent big $$ on a GPU for gaming, I really don't think buying GPUs for home makes sense, considering the hardware and running costs, when you can go to a site like vast.ai and borrow one for an insanely cheap amount to try it out. You'll probably get bored and be glad you didn't spend your kids' college fund on a rack of H100s.
There's some other reasons to run local LLMs. If it's on my PC, I can preload the context with, say, information about all the members of my family. Their birthdays, hobbies, favorite things. I can load in my schedule, businesses I frequent. I can connect it to local databases on my machine. All sorts of things that can make it a useful assistant, but that I would never upload into a cloud service.
Shouldn't the (MoE) mixture of experts approach allow one to conserve memory by working on specific problem type at a time?
> (MoE) divides an AI model into separate sub-networks (or "experts"), each specializing in a subset of the input data, to jointly perform a task.
Sort of, but the "experts" aren't easily divisible in a conceptually interpretable way so the naive understanding of MoE is misleading.
What you typically end up with in memory constrained environments is that the core shared layers are in fast memory (VRAM, ideally) and the rest are in slower memory (system RAM or even a fast SSD).
MoE models are typically very shallow-but-wide in comparison with the dense models, so they end up being faster than an equivalent dense model, because they're ultimately running through fewer layers each token.
I have a large repository of notes, article drafts, and commonplace book-type stuff. I experimented a year or so ago with a system using RAG to "ask myself" what I have to say about various topics. (I suppose nowadays I would use MCP instead of RAG?) I was not especially impressed by the results with the models I was able to run: long-winded responses full of slop and repetition, irrelevant information pulled in from notes that had some semantically similar ideas, and such. I'm certainly not going to feed the contents of my private notebooks to any of the AI companies.
You'd still use RAG, just use MCP to more easily connect an LLM to your RAG pipeline
To clarify: what I was doing was first querying for the documents via a standard document database query and then feeding the best matching documents to the LLM. My understanding is that with MCP I'd delegate the document query from the LLM to the tool.
As a beginner, I also haven't had much luck with embedded vector queries either. Firstly, setting it up was a major pain in the ass and I couldn't even get it to ingest anything beyond .txt files. Second, maybe it was my AI system prompt or the lack of outside search capabilities but unless i was very specific with my query the response was essentially "can't find what youre looking for"
What were you trying it in? With openwebui RAG pretty much worked out of the box.