Claude.ai is now at a 98.85% uptime. There's been so many frustrations with Claude / Anthropic lately (very heavy usage limits, wrong A / B testing, etc.).
Claude status: https://status.claude.com/
I have been really happy with my Codex subscription lately, but feels like these things change every other day. The OpenCode Go subscription for trying out GLM, Kimi, Qwen, Deepseek and friends also looks useful.
But nonetheless, Opus 4.6 is a very capable model, but justifying a Claude subscription gets more and more difficult, think I might just sometimes use it through OpenRouter or as part of something like Cursor (although I'm not sure about the value of that subscription as well).
OpenCode Go: https://opencode.ai/go
Cursor: https://cursor.com
There were periods where I was entirely unable to use Claude Code for hour+ due to auth gateway always returning 500 or timing out, there was an "elevated errors" incident shown on status.claude.com, but zero minute of downtime recorded (not even "partial outage"). So the real uptime should be even worse.
The real uptime being worse than reported is basically an iron law of status pages. You happened to hit one outage and I'm sure many others hit separate outages at different times that also weren't counted.
Sure. Difference is there are not many other services with uncounted hour long / multi-hour outages.
April has been a crazy month for open weights models. I've been using Claude Code for work and Kimi 2.6 for personal projects and Kimi has been very good. Glm-5.1 is also great. Qwen, Mimo and Deepseek I need to test some more, but they all have been producing good results. I have the impression that they are all are at the same level, or close to, Sonnet 4.6.
I've been using qwen 3.6 with oMLX on my M1 Mac Studio and it's been awesome. Took a while to get things set up, figure out which of the hundreds of models would be a good fit for my use case, and then get it strapped into opencode's harness, but it works! Its slower than a hosted model, obviously, but I'm tickled pink that I can give it a relatively complex chore, like I would've with my a Claude Pro subscription, and it'll churn away on it with good results and no god damn arbitrary usage limits.
What are you running them on?
Harness: opencode
Subscription: opencode go
I also use a claw agent[1] via Telegram, which uses pi.dev under the hood with my opencode go subscription.
[1] I forked one of those Claw projects (bareclaw) and made many changes to it.
When you say harness what do you mean? I see the term thrown around a lot and I think it's lost its meaning in some fashion.
In this context it means the tool you use the models with. So Claude Code is a harness, OpenAI's codex, Opencode, pi, etc. Those are all cli harnesses
Gotcha, thanks!
A fellow user replied below, but it refers to the software that uses the LLM (Claude Code, Opencode, pi.dev, etc.).
---
Funny you mention that, because I started noticing the word 'harness' being used everywhere about a month ago, even though I hadn’t seen it before (in this context). As I don’t trust my memory, I assumed I had just been overlooking it and added it to my vocabulary. However, a Google Trends search does show increased usage since the end of March: https://trends.google.com.br/trends/explore?date=today%203-m...
Interesting timing, because I think it was in March when I had a chat with Gemini about what the heck these things are supposed to be called, and that's where I first heard the term.
It's probably just a coincidence. But that would be pretty interesting if we have an example of some kind of memetic phenomenon where one or more popular LLMs makes a claim that people then start to repeat as true, or at least follow up on it and start writing about it, and in so doing the claim becomes true. Even if it didn't happen in this case, I feel like it's only a matter of time.
Not OP, but having explored the field a good bit, Openrouter + pi harness in a devcontainer work great as a sane starting point.
Highly recommend as a clean way to try out the upstart models.
They are close to Opus, not Sonnet.
The little qwen36 is at sonnet level . Kimi2.6 is about opus. The one can run on a single GPU on your gaming pc. The other you can run way cheaper from a provider. Or if you are really wealthy and have lots of gpus can run it yourself.
Not sure where deepseek 4 sits
Kimi 2.6 is nowhere near even Sonnet in overall robustness. It can get close when everything goes perfectly.
I have about 1KLOC of harness code written by Kimi to work around quirks in Kimi not needed for any other model I've tested, such as infinite toolcall loops and other weirdness.
You can do quite a bit with it and never run into those quirks, or you might hit it every request.
It is very sensitive to "confusing" things about it's environment in a way Sonnet and Opus are not.
Still great value, but they have some way to go.
Would "lots of gpus" even help for huge models? Maybe this is exposing my lack of knowledge but don't you need to keep the whole model and context in a single GPU's VRAM? My understanding is that multiple GPUs help with scaling (can handle N X inference requests simultaneously) but it doesn't help with using large models. If that were the case, I could jam another GPU in my box and double the size of model I can serve.
> Would "lots of gpus" even help for huge models? Maybe this is exposing my lack of knowledge but don't you need to keep the whole model and context in a single GPU's VRAM?
How do you think the large providers do inference? No single GPU has 1TB plus of memory on board. It’s a cluster of a bunch of gpus.
1t model instances(opus, gpt,etc) are not running on a single GPU. The catch is how the cards communicate and how the model is broken up. There's a bit that goes into it but the answer is yes the more gpus the bigger the model you can run.
Really cool. I'm very much still learning about this stuff. Sounds like this inter-GPU communication is a feature of special hardware (not consumer GPUs).
Ever hear of SLi (now called NVLink)? It's a GPU interconnect that's been available for a good long while now on high-end Nvidia GPU's. I believe AMD's implementation is called Crossfire.
GPU interconnect speeds are a big bottleneck today for GPU's in AI applications. Data can't move between them fast enough.
Not really, there's various ways it can be done but even I think the old 1080tis could do it. Keep reading about it, my interest is in small models on a single GPU though so I don't fuss over those details.
Most consumer cards had faster interlinks included on them until one generations ago when they decided they wanted to differentiate their data center hardware more, And remove the inner links that have been on the cards in various forms for 20 plus years.
Please don't oversell them. Eg Kimi k2.6 has a maximum context size of 270k, that's a quarter of opus.
The model is fine, Ive switched to it entirely for a personal project, but it's not opus.
And no, you're not running then locally unless you're a millionaire. You still need hundreds of GB (500+++) of VRAM on your graphics card - that's not at a level of consumer electronics.
Sure you can run the quantized models, but then you're at Haiku performance.
Whilst I agree with the premise, I think you are actually underselling them.
Claude becomes near lobotomized at beyond 500,000 tokens. I don't believe much quality code gets outputted at such high token counts, not to mentioned drastically increased cost.
270k isn't massive, but its very usable with compaction. Not every task needs the full context history.
Quantized models do have a quality / accuracy impact, although it is not as drastic as you suggest. There is some good data on this [0].
"These findings confirm that quantization offers large benefits in terms of cost, energy, and performance without sacrificing the integrity of the models. "
One thing that is worth mentioning is quant models are not created equally, they are not always scaling at the same rate. [1] For example not all tensors contribute equally to model accuracy. In practice, the most sensitive parts (such as key attention projections) are often quantized less aggressively to preserve the quality of the inference.
[0] - https://developers.redhat.com/articles/2024/10/17/we-ran-ove...
[1]- https://medium.com/@paul.ilvez/demystifying-llm-quantization...
Qwen 3.6 runs in a single GPU. But I mostly agree with you except, just because a model has a given context doesn't mean it's all available or entirely reliable.
You can run the big models in RAM, including via offloading weights from disk. They will be extremely slow on ordinary hardware, but they will run. Hundreds of gigabytes of RAM is a viable purchase for many, and the footprint can be split over multiple nodes with pipeline parallelism. If that's still too slow for the total throughput you expect to need on an ongoing 24/7 basis, that's when it becomes sensible to think about adding discrete GPUs for acceleration.
Yes multiple GPUs absolutely help with inference even for a single model instance. Some models are simply too big to fit on the largest available GPU.
Check out tensor parallelism
Tensor parallelism is not useful on consumer platforms with slow interconnects, unless compute is really low and you prioritize decreasing latency over throughput. pipeline parallelism (and potentially expert parallelism) are more workable.
Based on benchies or experience?
What is benchies?
I think OP means standardized benchmarks.
The last few days I've seen more degradations and canceled my Max subscription.
Presumptuous and wrong "memories" from a one-off command which affect all future commands, repeated/nonsensical phrases in messages, novel display bugs which make going back in the conversation impossible (I can't tell where I am), lack of basic forking features (resume a current convo in a second CC instance -> fork = no history for that convo?), poor/unclear reasoning, a new set of unclear folksy phrases (it really wants to "cut code" all of a sudden).
Qwen + Opencode has been a game changer: which runs very well on a 4090 for basic/exploratory/private tasks, and being able to switch to and between frontier models (using openrouter in my case) to avoid vendor lock in feels like basic hygiene.
There's also the homo economicus psychological difference between having a token budget to use up, and a cost per token. I'm more thoughtful about my usage now.
Would you mind sharing what specific Qwen model you are running and how (Llama, vLLM, etc.)?
If only opencode wasn't super buggy, it is really bad at just not returning responses, wasting tokens, duplicating responses, lagging, etc. It is nowhere near claude code levels, not even close. Even codex which is also not near claude is much better then opencode.
> Claude.ai is now at a 98.85% uptime.
So, at least better than GitHub, right? :)
Depends on how you count downtime, since Anthropic has much fewer different services.
But well, their ones are way harder to run.
Codex randomly stops working because some silly cybersecurity detector. Insane amount of false positives. Last time it happened, I was just letting it write me a small tool to translate the text in my clipboard. What cybersecurity? Code wasn't even published, or remotely like anything hacking related. I'm always letting AI write some boring CRUD tools that I don't want to code myself.
It's bordering on being useless.
It's probably their system prompt. Unlike Claude Code, they don't ban you for using different harness with their subscription (for now). If you use pi, their "safety" is off. Works great for me.
I have used past week opencode go with deepseek v4 pro and claude code with opus 4.7 side by side and... they are both good. They are different, both have their good and bad sides... but they do get things done. Especially the OpenCode has been very enjoyable experience. Thank you Anthropic for all the down time, I would have probably not explored alternatives otherwise. I can vouch for the OpenCode Go sub!
The "nines" measure of uptime is not some divine law. Even 80% Claude uptime would still be great value for money.
You just need to have some idea of what to do when your frontier model is not available. Use Qwen? Read the code you've been generating?
Multi-model coding tools seem like the obvious, sane path forward, but the Will to Lockin is strong.
Open multimodel tools will start dominating as soon as frontier labs stop massively subsidising their models only inside their tools and align with api pricing. Personally I think that the inflection point is near considering the slew of recent drama with Claude Code.
Claude Code and Codex are solid, but the real reason people use these over alternatives is that they have dramatically lower overall cost compared to open alternatives.
Generally speaking I think we should expect better.
But it did remind me of how Japanese websites sometimes have opening hours. The website shows a closed status page during the out if hours time.
Which I think makes some sense for some services for two reasons: your customers build habits and expectations around available service hours, and that in turn gives you regular maintenance windows that can accommodate large impactful changes.
It is one of the reasons a 24/hr public transit network doesn't make complete sense. You shouldn't disrupt a service because people come to rely on it, but you can't disrupt a service you never provided in the first place.
New codex limits make it unusable though. Switched to Opencode.
Codex has been pretty reliable. Google's API is a trash fire of 503s on their paid models. Copilot is a lottery too.