I think it is all well and good, but the most affordable option is probably still to buy a used MacBook with 16/32 or 64 GB (depending on the budget) unified memory and install Asahi Linux for tinkering.

Graphics cards with decent amount of memory are still massively overpriced (even used), big, noisy and draw a lot of energy.

> and install Asahi Linux for tinkering.

I would recommend sticking to macOS if compatibility and performance are the goal.

Asahi is an amazing accomplishment, but running native optimized macOS software including MLX acceleration is the way to go unless you’re dead-set on using Linux and willing to deal with the tradeoffs.

It just came to my attention that the 2021 M1 Max 64gb is less than $1500 used. That’s 64gb of unified memory at regular laptop prices, so I think people will be well equipped with AI laptops rather soon.

Apple really is #2 and probably could be #1 in AI consumer hardware.

Apple is leagues ahead of Microsoft with the whole AI PC thing and so far it has yet to mean anything. I don't think consumers care at all about running AI, let alone running AI locally.

I'd try the whole AI thing on my work Macbook but Apple's built-in AI stuff isn't available in my language, so perhaps that's also why I haven't heard anybody mention it.

People don’t know what they want yet, you have to show it to them. Getting the hardware out is part of it, but you are right, we’re missing the killer apps at the moment. The very need for privacy with AI will make personal hardware important no matter what.

Two main factors are holding back the "killer app" for AI. Fix hallucinations and make agents more deterministic. Once these are in place, people will love AI when it can make them money somehow.

How does one “fix hallucinations” on an LLM? Isn’t hallucinating pretty much all it does?

No no, not at all, see: https://openai.com/index/why-language-models-hallucinate/ which was recently featured on the frontpage - excellent clean take on how to fix the issue (they already got a long way with gpt-5-thinking-mini). I liked this bit for clear outline of the issue:

´´´Think about it like a multiple-choice test. If you do not know the answer but take a wild guess, you might get lucky and be right. Leaving it blank guarantees a zero. In the same way, when models are graded only on accuracy, the percentage of questions they get exactly right, they are encouraged to guess rather than say “I don’t know.”

As another example, suppose a language model is asked for someone’s birthday but doesn’t know. If it guesses “September 10,” it has a 1-in-365 chance of being right. Saying “I don’t know” guarantees zero points. Over thousands of test questions, the guessing model ends up looking better on scoreboards than a careful model that admits uncertainty."´´´´

Coding agents have shown how. You filter the output against something that can tell the llm when it’s hallucinating.

The hard part is identifying those filter functions outside of the code domain.

It's called a RAG, and it's getting very well developed for some niche use cases such as legal, medical, etc. I've been personally working on one for mental health, and please don't let anybody tell you that they're using an LLM as a mental health counselor. I've been working on it for a year and a half, and if we get it to production ready in the next year and a half I will be surprised. In keeping up with the field, I don't think anybody else is any closer than we are.

Other than that, Mrs. Lincoln, how was the Agentic AI?

You can’t fix the hallucinations

  > People don’t know what they want yet, you have to show it to them
Henry Ford famously quipped that had he asked his customers what they wanted, they would have wanted a faster horse.

We've shown people so many times and so forcefully that they're now actively complaining about it. It's a meme.

The problem isn't getting your Killer A I App in front of eyeballs. The problem is showing something useful or necessary or wanted. AI has not yet offered the common person anything they want or need! The people have seen what you want to show them, they've been forced to try it, over and over. There is nobody who interacts with the internet who has not been forced to use AI tools.

And yet still nobody wants it. Do you think that they'll love AI more if we force them to use it more?

And yet still nobody wants it.

Nobody wants the one-millionth meeting transcription app and the one-millionth coding agent constantly, sure.

It a developer creativity issue. I personally believe the creativity is so egregious, that if anyone were to release a killer app, the entirety of the lackluster dev community will copy it into eternity to the point where you’ll think that that’s all AI can do.

This is not a great way to start off the morning, but gosh darn it, I really hate that this profession attracted so many people that just want to make a buck.

——-

You know what was the killer app for the Wii?

Wii Sports. It sold a lot of Wiis.

You have to be creative with this AI stuff, it’s a requirement.

[dead]

[deleted]

Ryzen AI 9 395+ with 64MB of LPDDR5 is 1500 new in a ton of factors and 2k with 128. If I have 1500 for a unified memory inference machine I'm probably not getting a Mac. It's not a bad choice per se, llama.cpp supports that harware extremely well, but a modern Ryzen APU at the same price is more of what I want for that use case, with the M1 Mac youre paying for a Retina display and a bunch of stuff unrelated to inference.

Ryzen 9 doesn't exist in Europe

Not just LPDDR5, but LPDDR5X-8000 on a 256-bit bus. The 40 CU of RDNA 3.5 is nice, but it's less raw compute than e.g. a desktop 4060 Ti dGPU. The memory is fast, 200+ GB/s real-world read and write (the AIDA64 thread about limited read speeds is misleading, this is what the CPU is able to see, the way the memory controller is configured, but GPU tooling reveals full 200+ GB/s read and write). Though you can only allocate 96 GB to the iGPU on Windows or 110 GB on Linux.

The ROCm and Vulkan stacks are okay, but they're definitely not fully optimized yet.

Strix Halo's biggest weakness compared to Mac setups is memory bandwidth. M4 Max gets something like 500+ GB/s, and M3 Ultra gets something like 800 GB/s, if memory serves correctly.

I just ordered a 128 GB Strix Halo system, and while I'm thrilled about it, but in fariness, for people who don't have an adamant insistence against proprietary kernels, refurbished Apple silicon does offer a compelling alternative with superior performance options. AFAIK there's nothing like Apple Care for any of the Strix Halo systems either.

The 128 GB Strix Halo system was tempting me, but I think I'm going to hold out for the Medusa Point memory bandwidth gains to expand my cluster setup.

I have a Mac Mini M4 Pro 64GB that does quite well with inference on the Qwen3 models, but is hell on networking with my home K3s cluster, which going deeper on is half the fun of this stuff for me.

>The 128 GB Strix Halo system was tempting me, but I think I'm going to hold out for the Medusa Point

I was initially thinking this way too, but I realized a 128GB Strix Halo system would make an excellent addition to my homelab / LAN even once it's no longer the star of the stable for LLM inference - i.e. I will probably get a Medusa Halo system as well once they're available. My other devices are Zen 2 (3600x) / Zen 3 (5950x) / Zen 4 (8840u), an Alder Lake N100 NUC, a Twin Lake N150 NUC, along with a few Pi's and Rockchip SBC's, so a Zen 5 system makes a nice addition to the high end of my lineup anyway. Not to mention, everything else I have maxed out at 2.5GbE. I've been looking for an excuse to upgrade my switch from 2.5GbE to 5 or 10 GbE, and the Strix Halo system I ordered was the BeeLink GTR9 Pro with dual 10GbE. Regardless of whether it's doing LLM, other gen AI inference, some extremely light ML training / light fine tuning, media transcoding, or just being yet another UPS-protected server on my LAN, there's just so much capability offered for this price and TDP point compared to everything else I have.

Apple Silicon would've been a serious competitor for me on the price/performance front, but I'm right up there with RMS in terms of ideological hostility towards proprietary kernels. I'm not totally perfect (privacy and security are a journey, not a destination), but I am at the point where I refuse to use anything running an NT or Darwin kernel.

That is sweet! The extent of my cluster is a few Pis that talk to the Mac Mini over the LAN for inference stuff, that I could definitely use some headroom on. I tried to integrate it into the cluster directly by running k3s in colima - but to join an existing cluster via IP, I had to run colima in host networking mode - so any pods on the mini that were trying to do CoreDNS networking were hitting collisions with mDNSResponder when dialing port 53 for DNS. Finally decided that the macs are nice machines but not a good fit for a member of a cluster.

Love that AMD seems to be closing the gap on the performance _and_ power efficiency of Apple Silicon with the latest Ryzen advancements. Seems like one of these new miniPCs would be a dream setup to run a bunch of data and AI centric hobby projects on - particularly workloads like geospatial imagery processing in addition to the LLM stuff. Its a fun time to be a tinkerer!

It’s not better than the Macs yet. There’s no half assing this AI stuff, AMD is behind even the 4 year old MacBooks.

NVDIA is so greedy that doling out $500 dollars will only you get you 16gb of vram at half the speed of a M1 Max. You can get a lot more speed with more expensive NVDIA GPUs, but you won’t get anything close to a decent amount of vram for less than 700-1500 dollars (well, truly, you will not get close to 32gb even).

Makes me wonder just how much secret effort is being put in by MAG7 to strip NVDIDA of this pricing power because they are absolutely price gouging.

I recently got an M3 Max with 64g (the higher spec max) and ts been a lot of fun playing with local models. It cost around $3k though even refurbished.

M1 doesn't exactly have stellar memory bandwidth for this day and age though

M1 Max with 64GB has 400GB/s memory bandwidth.

You have to get into the highest 16-core M4 Max configurations to begin pulling away from that number.

Oh sorry I thought it was only about 100. I'd read that before but I must have remembered incorrectly. 400 is indeed very serviceable.

[deleted]

Get an Apple Silicon MacBook with a broken screen and it’s an even better deal.

The mini pcs based on AMD Ryzen AI Max+ 395 (Strix Halo) are probably pretty competitive with those. Depending on which one you buy it's $1700-2000 for one with 128GB RAM that is shared with the integrated Radeon 8060S graphics. There's videos on youtube talking about using this with the bigger LLM models.

If Moore's Law is Ending leaks are to be believed, there are going to be 24GB GDDR7 5080 Super and maybe even 5070 Super Ti variants in the 1k (MSRP) range and one assumes fast Blackwell NVFP4 Tensor Cores.

Depends on what you're doing, but at FP4 that goes pretty far.

You dont even need Asahi, you can run comfy on it but I recommend the Draw Things app, it just works and holds your hand a LOT. I am able to run a few models locally, the underlying app is open source.

I used Draw Thing after fighting with comfyui.

What about AMD Ryzen AI Max+ 395 Mini PCs with upto 128GB unified memory?

Their memory bandwidth is the problem. 256 GB/s is really, really slow for LLMs.

Seems like at the consumer hardware level you just have to pick your poison or what one factor you care about most. Macs with a Max or Ultra chip can have good memory bandwidth but low compute, but also ultra low power consumption. Discrete GPUs have great compute and bandwidth but low to middling VRAM, and high costs and power consumption. The unified memory PCs like the Ryzen AI Max and the Nvidia DGX deliver middling compute, higher VRAMs, and terrible memory bandwidth.

It's an underwhelming product in an annoying market segment, but 256GB/s really isn't that bad when you look at the competition. 150GB/s from hex channel DDR4, 200GB/s from quad channel DDR5, or around 256GB/s from Nvidia Digits or M Pro (that you can't get in the 128GB range). For context it's about what low-mid range GPUs provide, and 2.5-5x the bandwidth of the 50/100 GB/s memory that most people currently have.

If you're going with a Mac Studio Max you're going to be paying twice the price for twice the memory bandwidth, but the kicker is you'll be getting the same amount of compute as the AMD AI chips have which is going to be comparable to a low-mid range GPU. Even midrange GPUs like the RX 6800 or RTX 3060 are going to have 2x the compute. When the M1 chips first came out people were getting seriously bad prompt processing performance to the point that it was a legitimate consideration to make before purchase, and this was back when local models could barely manage 16k of context. If money wasn't a consideration and you decided to get the best possible Mac Studio Ultra, 800GB/s won't feel like a significant upgrade when it still takes 1 minute to process every 80k of uncached context that you'll absolutely be using on 1m context models.

But for matrix multiplication, isn't compute more important, as there are N³ multiplications but just N² numbers in a matrix?

Also I don't think power consumption is important for AI. Typically you do AI at home or in the office where there is lot of electricity.

>But for matrix multiplication, isn't compute more important, as there are N³ multiplications but just N² numbers in a matrix?

Being able to quickly calculate a dumb or unreliable result because you're VRAM starved is not very useful for most scenarios. To run capable models you need VRAM, so high VRAM and lower compute is usually more useful than the inverse (a lot of both is even better, but you need a lot of money and power for that).

Even in this post with four RPis, the Qwen3 30 A3B is still an MOE model and not a dense model. It runs fast with only 3B active parameters and can be parallelized across computers but it's much less capable than a dense 30B model running on a single GPU.

> Also I don't think power consumption is important for AI. Typically you do AI at home or in the office where there is lot of electricity.

Depends on what scale you're discussing. If you want to get similar VRAM as a 512GB Mac Studio Ultra with a bunch of Nvidia GPUs like RTX 3090 cards you're not going to be able to run that on a typical American 15 AMP circuits, you'll trip a breaker half way there.

Works very well and very fast with this Qwen3 30B A3B model.