Hacker News

Useful stuff in here that I wish I'd seen a few days ago :-)

I am not convinced that the MTP setup for the QAT model adds very much in terms of speed on my M1 Max, but it is definitely worth experimenting with.

Fiddling about with local models has done so much for my conceptual understanding of what is going on.

FWIW and YMMV but I also found the Gemma 4 MTP head was occasionally breaking markup in Opencode, causing the thinking to display untidily and ultimately in some cases missing the stop token. So I've stopped using MTP there for now.

Recent Qwen 3.6 models have developer role support so it will occasionally surprise you with a structured multiple choice questionnaire.

mft_ a day ago [ - ]

I found a marginal downside to Qwen3.6-35B-A3B-MTP vs. the non-MTP equivalent on an M1 Max. I’ll maybe experiment with settings further though.

smcleod 5 hours ago [ - ]

Use the 27b, it's better in every way once you add MTP (which speeds up dense models but often doesn't add any performance to MoE models like the 35b-a3b). I get around 100TK/s on my 2x 3090 machine and 85 on my M5 Max.

mft_ 18 minutes ago [ - ]

Thanks, I'll give it a go.

(I generally find standard 27B too slow to enjoy using, whereas 35B-A3B is pretty snappy.)

freehorse a day ago [ - ]

And the upsides of using draft models for MOE models with so low number of active parameters (as here or as in the article) are quite low, compared to dense models where you can get enormous speedups. I would prefer running the dense 27b models with speculative decoding instead.

dofm a day ago [ - ]

That is what I have learned, yes. Not tested the dense Qwen yet. IIRC the 31B Gemma was slow enough that I doubt MTP will help me much.

dofm a day ago [ - ]

Yeah. I think it might speed up time to first token but I am not sure how much that matters.

I do enjoy their different personalities when they are tackling "explain this" type puzzles, though.

Gemma writes so well — like a concise code blogger. It makes you understand that the thing we hate about AI slop writing is specifically the cheesy, marketingese sycophantic ChatGPT tone. It's a choice to sound that way.

Qwen writes more tersely by default, like much english language documentation in Chinese open source projects. A couple of lines, code example, fact, code example, line of blurb.

I use this prompt every now and then with a new model. It's obviously a classic SQL puzzle but I've asked new web developers this in the past (prompted by discovering that a client's subcontractor didn't understand it and was therefore unable to migrate some code from relying on dodgy pre-MySQL 5.x behaviours)

—

  I have a MySQL 5 table like this: [id, label, category, score].   It contains a list of items in different categories (text names like cat1, cat2, cat3) with a numerical score. Is there a way I can write a SQL query to find the item in each category that has the highest score, without using a subquery? No two entries in any category share a score.

—

I enjoy seeing what it deduces from the subtext.

Without "thinking" mode on, they always initially fail and you need to prompt them to find the answer. With thinking mode, they both produce really nice explanations.

For me, as an old freelancer who is pretty cynical about vibe coding or "agentic engineering", what I really want is an AI tool that can help me start to solve problems and help me find the right terminology or generate some boilerplate I can tinker with. Both of these models do fine at the kind of "starter" writing that I want when I am trying to untangle an idea.

mark_l_watson a day ago [ - ]

when I started using QAT recently, I stopped trying to improve my configuration after that. I will try tuning my local environment again in a few months, but with QAT things are good enough for now.