Right. Opus 4.5 8 months ago, good enough for agentic coding. How far behind that are open weight models? More than 8 months? But how much more? When will they reach Opus 4.5 level? A few months from now? A year from now? Never?

The power of Opus isn't just the model, it's in the harness too.

You can try it by using Opus through Github Copilot vs official Anthropic tools. You'll get very different results and experience (in my opinion).

GitHub Copilot in vscode has two ways to access Opus: the Copilot harness or the Claude Code Agent SDK within Copilot.

And that's if we assume that the vscode GHCP default Agent ("Local") is the same as the "Copilot CLI" one that is also selectable in vscode. I have not tried that one.

A few weeks ago the Claude Code Agent SDK was much better than the default Copilot Agent, but nowadays I am not sure.

open source harnesses are also improving rapidly.

Some people would claim they are already far better than CC and Codex.

which harnesses, and which when?

I’ve only used Opus in GitHub copilot and was hugely underwhelmed. It was barely usable. Are you saying it’s better with the official Anthropic tools?

Night and day in my opinion. But these are all purely Feels so YMMV etc.

I like how especially the Claude Code CLI version communicates how it's progressing, something they hide a lot more on the desktop app for example.

I don't know about better but it's certainly different. It's painfully slow through claude code vscode extension compared to copilot but maybe "smarter", I feel like I have to correct it less using sonnet on both. I don't use opus much because of the cost but coworkers say the difference between harnesses there is also pronounced.

I've tried Opus 4.6 in the Opencode harness through the Github Copilot API, and I've tried Opus 4.8 in Claude Code. I found I preferred Opus 4.6 in Opencode (and in general, I like Opencode much more in that it hid less from me). I found both to be pretty similar as far as efficacy (I was surprised that Opus 4.8 felt like such a minor improvement over 4.6).

I think in the next 6 months we will have Opus 4.5 performance in open models. We are very close

We need first to reach level of Sonnet 4.x, we aren't at that level yet.

GLM 5.2 is comfortably at Sonnet 4 at the very least. Same with Minimax M3

GLM 5.2 came out today and the early reports have been quite good. Very difficult to run except on prosumer hardware, but small business could quite easily (or something like open router).

Opus also has a deeply ingrained personality that always de-rails sneakily into what it's taught, not what the user intends. This is good if the user doesn't know the details of the work they need performed and a huge time waste when the user knows exactly how something needs to be implemented.

I have found claude models, especially fable, to be impossible to work with when the work requires reading papers from days ago and reasoning on top of the findings in it. I have multiple long sessions with opus (not as many with fable as it got taken down quickly) where it keeps fighting me on problems, sayings "that's not how it works" / "that is not possible", followed by me linking the paper (after i've told it to actually read up on the latest research in this field), and it hits me with the usual "You were right.". If your workflow is using the exact tools, frameworks, git layouts that claude expects, it can be magical, yes. But it is very heavily optimized to never say 'I am not sure' (as that gives 'bad vibes') and instead lean on its (nowadays with the speed of things DOE) knowledge to formulate a reasonable sounding answer, dissectible only if you already know the answer beforehand (which defeats the purpose of using it in the first place).

Qwen3.6 27B (the only <100B model worth looking at in my experience) is dumb, knows it, and will fight tooth and nail to complete the task it was given, gaining the needed context (online or file-wise) in the meantime. If you mention it should read papers, it goes and reads a pile of papers. If you tell it 'implement MCP in my app', the result will (probably) be catastrophic. If you instead describe where the feature should sit, how it should handle edge cases, what use cases it needs to attend to, and to first look online for reference implementations, it does it and does it well.

Knowing what is in context, what should and shouldn't be there, and how to manage it for the specific model you are using (as every model, even in the same family, behaves differently to differently worded prompts) is what makes or breaks them. They are just auto-complete, they complete text based on what is already there, it's not magic.

So yes, while this small open-weights models are not opus 4.5, it's good precisely because if that, because it is a good tool and a bad 'coworker replacement'. If you want the latter, kimi is already there, it has started to not believe the user and do what it was taught just like claude models (which is helpful when you don't care about implementation specifics or performance/security). GLM models (mostly 5.1, i haven't tested 5.2 extensively yet) have fixed a lot of low-level programming issues I've had that opus just walks in circles and writes reports that "it doesn't/can't work". That is to say, open-weights, in many cases, have already surpassed Opus. I can't comment on gpt 5.5, but while I used 5.4, it also performed a lot more tasks without being fussy than opus 4.6/4.7.

> I have multiple long sessions with opus (not as many with fable as it got taken down quickly) where it keeps fighting me on problems, sayings "that's not how it works" / "that is not possible", followed by me linking the paper (after i've told it to actually read up on the latest research in this field), and it hits me with the usual "You were right.".

I genuinely do not understand why people not only just put up with this but also pay _a lot of money_ for the _privilege_ of doing so.

It's like having _the worst_ colleague but you actually go out of your way to talk with the guy. Why.