Hacker News

FWIW I have not, on a 64GB M1 Max, seen any advantage from oMLX specifically or MLX generally over GGUF with llama.cpp.

The Gemma 4 MLX builds I have found so far have been slower at the same quantisation and much slower with MTP.

The built-in web UI for llama.cpp is really quite good once you have chosen your model. Otherwise I quite like LM Studio for tinkering.

One thing I would say is that both Gemma-4 and Qwen 3.6 simply do not need a large chunk of the typical opencode system prompt. Better off without it.

fouc 14 hours ago [ - ]

what? you're saying both MLX and MTP have been slower for your mac?

dofm 5 hours ago [ - ]

MLX in the forms I have tried (LM Studio, oMLX) with the models I have tried (Gemma and Qwen) have both been apparently slower, yes.

I have not done in-depth, really controlled testing and there is much about performance tuning I don't understand, but it's fairly clear to me that on an M1 Max, MLX does not have the massive advantage it may have on other machines or other models.

It is wholly possible that MLX is _much_ better on the M3 and up, because the neural engine is that much better.

Frankly I think llama.cpp may simply have caught up quite a lot.

MTP is the same issue. There is always a chance that adding a separate MTP draft model has more compute overhead than it brings in terms of speedup, and since I am using an older machine and the MoE models, I am not actually in a zone where MTP can actually add much. What happens is that there's an enormous advantage in speed handling while the prompt and the early reasoning and it then tails off dramatically to be worse, on average, than non MTP.

(Qwen 3.5 35B shows, possibly, a small advantage if its internal MTP is enabled. But it is small — 10% maybe.)

For the 26B Gemma 4, MLX and MTP combined were noticeably slower than the GGUF is with llama.cpp.

If it were a newer machine with a larger, dense model, I'd definitely expect to see an advantage from MTP, and it is possible that there are some parameters I can tweak (duplicate token penalty, temperature, shared cache stuff) that give MTP more of an edge (keep its successful prediction rate higher).

Either way, it feels like the smallish gain I will see on this particular bit of kit might not be worth the long, long journey down that rabbit hole right now.

amboo7 14 hours ago [ - ]

I also have an M1 Max 64GB: Qwen 3.6 benefits from MTP (after rounds of parameter optimization). MLX was unstable (haven't tried it recently), faster at TG but slower at PP, so inconclusive.

dofm 8 hours ago [ - ]

Yeah. I have not really tinkered much with parameter optimisation for the 35B model with MTP. Would be interested to see what you've found.

I'm using the GGUF too; it appears slightly faster in llama.cpp now than current LM Studio but it's not clear to me if that is down to LM Studio having a little more code overhead, older llama.cpp under the hood, or just parameter differences.

amboo7 6 hours ago [ - ]

[dead]