Hacker News

Huh, according to that model card this is a 137B total parameter model.

Performance doesn't seem that good:

- MAI-Code-1-Flash (137B-A5B) = 51% on SWE-bench pro

- Qwen3.6-35B-A3B = 49.5% on SWE-bench pro (https://huggingface.co/Qwen/Qwen3.6-35B-A3B)

They benchmark against Claude Haiku but Haiku is not good, it's worse than tiny open models you can run locally or via API at 10% the cost.

davecitron 13 hours ago [ - ]

Dave Citron here, from the MAI team. Thanks for the feedback, we're getting the model card updated to call out 5B active parameters (137B total).

On benchmarks: in the same VS Code harness, MAI-Code-1-Flash scored 51.2% on SWE-bench Pro vs. Haiku's 35.2% which we see as a pretty big leap. But going forward, we'll include additional models in our benchmarks, including models like Qwen 3.6 and Gemma 4.

easygenes 12 hours ago [ - ]

Have you run it through DeepSWE? I understand that's probably a high ask for this class of model, but would be interesting to see regardless.

Even if it can't fully pass much, there are so many tests against most of the scenarios that you can get a fairly rich report beyond the pass@1 stat. See e.g. this DeepSWE report against the Minimax M3 model: https://entrpi.github.io/misc/deep-swe-minimax-m3/

kosolam 9 hours ago [ - ]

Hey Dave, I’d love to add your new model in the harness I’m going to opensource very soonish. Going to publish benchmarks on real world tasks.

sfifs 10 hours ago [ - ]

Qwen is definitely the model to beat as of Mid 2026. While I didn't benchmark with SWE as my use cases are OpenClaw [1]. I found both Qwen 3.6 35B A3B and more impressively Qwen 3.5 122B A10B starting to be competitive with closed flash models. The NVFP4 quant of the latter is what I'm running now on DGX.

[1] https://srinathh.medium.com/mid-size-local-models-are-now-co...

abustamam an hour ago [ - ]

How does qwen compare to deepseek or kimi? I haven't spent much time with qwen but I find deepseek to be mostly comparable to opus for my pet projects. Kimi k2.6 did a lot of stupid stuff and talked to itself a lot "let me do X... Wait, X doesn't make sense because the user explicitly said Y"

Deepseek seems to seek first to understand before going off.

giancarlostoro a day ago [ - ]

The take away is that this model is a smaller model that competes with Haiku, I would hope they come out with a "Sonnet" competing model, then Opus. I have been wondering why Microsoft is kind of "sleeping" on offering models they themselves have made on Copilot, maybe it was part of their deal with OpenAI? Not sure.

mdasen a day ago [ - ]

Yes, it's a "smaller" (137B) model that competes with Haiku, but it's basically the performance of Qwen3.6-35B-A3B which is 75% smaller and 98% smaller in terms of active parameters (since it's a mixture of experts model). Microsoft should be comparing its model to good smaller models, not Haiku 4.5.

Qwen-3.6-27b is closer to Claude Opus 4.7 than it is to Haiku 4.5 in a lot of benchmarks - and it's way smaller than Microsoft's new model.

Sure, it competes with Haiku, but it shows how far Microsoft is behind lots of other small models that are available.

IanCal 11 hours ago [ - ]

> 98% smaller in terms of active parameters (since it's a mixture of experts model).

I don’t think that’s right, this flash model is 5B active params. Qwen3.6-35B-A3B is 3B so 40% smaller.

stingraycharles a day ago [ - ]

I understand what you’re saying, but I am generally very careful when comparing models and their benchmarks; benchmarks often don’t really match “real world” quality.

yorwba 11 hours ago [ - ]

The technical report https://microsoft.ai/wp-content/uploads/2026/06/main_2026060... has a lot of detail about decontaminating their training data and developing new in-house benchmarks to ensure reliable evaluation. If other models were just overfit to public benchmarks while Microsoft produced something that generalizes better to unseen data, they could've used those in-house benchmarks to argue that point.

Instead, they only do cherry-picked comparisons against Anthropic's small models, and not the full spectrum of competitors.

Without evidence to the contrary, I'll interpret this as just what happens when you're late to the party and insist on doing everything from scratch.

Maybe coaxing reasoning behavior out of their base model without kickstarting it by distilling from existing models provided them with valuable experience that will help improve their future models, or maybe it was an unnecessary waste of time.

fmajid 6 hours ago [ - ]

If their model was trained purely on properly licensed data, the reduced legal liability could be a selling point

davecitron 13 hours ago [ - ]

[dead]

minraws a day ago [ - ]

They did release, MAI-Thinking-1 to compete with Sonnet. Totally not sure why that isn't at the top here.

ignoramous 9 hours ago [ - ]

Can't yet use MAI-Thinking-1? [0] And no indication of it being made available in GitHub Copilot, either.

[0] Not even here: https://playground.microsoft.ai/

giancarlostoro a day ago [ - ]

Good question, and I missed that entirely!

lostmsu 20 hours ago [ - ]

Compete? It is behind Kimi K2.6, which is in turn away behind Sonnet.

kristjansson a day ago [ - ]

> 137B-A5B

Yeah, not a 5B param model as the earlier title implied!

a day ago [ - ]

[deleted]

epolanski 13 hours ago [ - ]

So what other models use less than half of Haiku's tokens while providing higher success rate?

akie 13 hours ago [ - ]

Why is Haiku the benchmark though, with code generation don't we primarily care about the quality of the code - not the speed or efficiency at which it's generated?

NitpickLawyer 12 hours ago [ - ]

You would be surprised how much code haiku writes behind the scenes. With the whole 'plan w/ opus, spawn subagents w/ haiku' that cc does. And you'd be surprised how useful the small models can be under some guidance / hand holding. You can daily-drive gpt5-mini and still find it useful. They're not as good as the big ones, obviously, and can't handle a project start-to-finish on their own, but given a well-scoped task, they'll do it just fine.

epolanski 12 hours ago [ - ]

I'm not sure I follow, but I'll give you a very fresh example.

I was implementing a re-print functionality in my warehouse management system.

It took Opus 4.8 high 24m1s and 87k tokens. Took Haiku 6m30s and 41k tokens.

After that time I had to provide (minor) adjustments to both. But Haiku allowed me to iterate faster. Code quality for that somewhat trivial use case was similar.

Actually, I would even say that Opus provided a sub par solution: instead of fixing an issue where carrier label pdf wasn't saved as the state machine progressed to the latest step, it went through a much complex solution of re-generating those by scratch. Which is also wrong, as it was de-facto booking the carriers twice for the same order.

Haiku simply added another field on the terminal state that carried the already generated urls.

I don't think it's a good idea to default to highest effort/bigger model without taking into account the time it takes and the task complexity.

Imho we should experiment rather than assume that what the rest of the community does to be the best practice.

vinzenzu 12 hours ago [ - ]

Totally agree. I've been using cheap Chinese open-source models via OpenCode Go, and they are faster, cheaper and in my experience arrive at the solution quicker because they are more pragmatic.

Yesterday Codex was making a big issue out of a new module that was upgraded in our cluster and because of which the same SSH key would be "regenerated" by Terraform. No big deal, it just truncates a newline at the end of the SSH key and it works all the same. But not being aware that this, as an example, is unimportant can cost a lot more time than using the big models saves.

easygenes 16 hours ago [ - ]

While I agree directionally, I'll caveat that "cost per token" != "cost per task". In the case of Qwen3.6 it tends to think 1.6x more than Haiku, so the cost of Haiku on the same tasks tends to only be about double. More detail from comparing their Artificial Analysis metrics:

  Qwen3.6-35B-A3B   vs   Claude Haiku 4.5
    reasoning mode · AA Intelligence Index v4.0
  
  46.0 ┤   ↖ better — cheaper · smarter · faster
       │
       │
  44.0 ┤     ╭─────╮
       │     │  ●  │ Qwen3.6-35B-A3B
       │     ╰─────╯
  42.0 ┤
       │
       │
  40.0 ┤
       │
       │
  38.0 ┤                                       ╭───╮
       │                      Claude Haiku 4.5 │ ○ │
       │                                       ╰───╯
  36.0 ┤
       └┬─────────┬─────────┬─────────┬─────────┬────────┬
        $200    $300      $400      $500      $600    $700
  
    x → cost to run the index (USD)        lower is better
    y → AA intelligence index              higher is better
  
    bubble area = output speed (tokens / sec)
          ╭─────╮                  ╭───╮
          │  ●  │ Qwen ~196 t/s    │ ○ │ Haiku ~93 t/s
          ╰─────╯                  ╰───╯
  
    ┌─────────────────────┬──────────┬──────────┬───────────┐
    │ model               │ AA index │ run cost │ out speed │
    ├─────────────────────┼──────────┼──────────┼───────────┤
    │ Qwen3.6-35B-A3B    ●│   43.5   │   $280   │  196 t/s  │
    │ Claude Haiku 4.5   ○│   37.1   │   $620   │   93 t/s  │
    └─────────────────────┴──────────┴──────────┴───────────┘


    COST PER TOKEN   ≠   COST PER TASK  
    output tokens per index run:
       Haiku 4.5    87.3M   (79.3M reasoning + 8.0M answer)
       Qwen3.6     143.2M   (131.7M reasoning + 11.5M answer)
       → Qwen emits 1.64× more output
  
    ── output speed (tokens / sec) ──────────  raw rate · higher = faster
       Qwen3.6     100%   ~196 t/s
       Haiku 4.5   ~47%   ~93 t/s
                                                  → Qwen ~2.1× faster per token
  
          ╎   1.64× more tokens  <  2.1× faster rate
          ▼
  
    ── solution speed (per finished answer) ──  higher = faster
       Qwen3.6     100%
       Haiku 4.5   ~78%
                                                  → Qwen ~1.3× FASTER to a solution
  
    SCORECARD
                            intelligence    cost / task     speed to solution
     Qwen3.6-35B-A3B        43.5            $280            ~1.3× faster 
     Claude Haiku 4.5       37.1            $620            (slower)
  
     → Qwen wins all three. The reasoning blow-up (1.64×) is smaller than
       the raw-speed edge (2.1×), so Qwen stays ahead per task.

HarHarVeryFunny 7 hours ago [ - ]

How did you get that nicely formatted graph and table in your post ?!

Krysoph 6 hours ago [ - ]

> Text after a blank line that is indented by two or more spaces is formatted as code.

https://news.ycombinator.com/formatdoc

  crimes ↑
         │
   10.0  ┤                                           ● Airport burger
         │                                      ╭──────────────╮
    8.0  ┤                                      │  theft arc   │
         │                                      ╰──────────────╯
    6.0  ┤                         ● Five Guys
         │
    4.0  ┤              ● Food truck burger
         │
    2.0  ┤      ● McBurger
         │
    0.0  ┤ ● Homemade burger
         │
         └───────┬─────────┬─────────┬─────────┬─────────→ price
                $2        $8        $14       $22       $38

  ┌────────────────────┬────────┬──────────────┬────────────────────┐
  │ burger             │ price  │ crime index  │ expected behavior  │
  ├────────────────────┼────────┼──────────────┼────────────────────┤
  │ Homemade burger    │   $2   │          0.0 │ law-abiding citizen│
  │ McBurger           │   $6   │          1.4 │ steals extra napkin│
  │ Food truck burger  │  $11   │          3.1 │ lies about hunger  │
  │ Five Guys          │  $18   │          6.2 │ financial crime    │
  │ Airport burger     │  $34   │          9.7 │ enters villain arc │
  └────────────────────┴────────┴──────────────┴────────────────────┘

  conclusion: burger inflation is a gateway condiment

HarHarVeryFunny 6 hours ago [ - ]

Thanks, so in this case the value of "code fomatting" is using a fixed-width font ?

The next question is where did the "ASCII-art" graph and table come from? Are there sites to generate these?

Krysoph 4 hours ago [ - ]

The code formatting puts the content into a <pre> which preserves spaces, indentation and line breaks.

Just built a tool for that: https://krysoph.github.io/UnicodeData/

It is a single html file with no dependencies, it takes json data and turns into unicode charts.

Source: https://github.com/Krysoph/UnicodeData

HarHarVeryFunny 32 minutes ago [ - ]

Neat!

If I use your tool and "Copy HN-ready" and paste here then it works, but oddly if I then edit the post the formatting is lost.

Also, if I just manually post, starting with a blank line, followed by a couple of lines starting with two spaces (e.g. " aaa", " bbb"), then I'm not getting the <pre> code formatting. Any idea what I might be doing wrong?

an hour ago [ - ]

[deleted]

40 minutes ago [ - ]

[deleted]

an hour ago [ - ]

[deleted]

37 minutes ago [ - ]

[deleted]

41 minutes ago [ - ]

[deleted]

4 hours ago [ - ]

[deleted]