I’d encourage devs to use MiniMax, Kimi, etc for real world tasks that require intelligence. The down sides emerge pretty fast: much higher reasoning token use, slower outputs, and degradation that is palpable. Sadly, you do get what you pay for right now. However that doesn’t prevent you from saving tons through smart model routing, being smart about reasoning budgets, and using max output tokens wisely. And optimize your apps and prompts to reduce output tokens.
I won’t use anything less than the SOTA. It tried using Opus 4.6 medium and immediately regretted it. High messes up enough.
What were you using 6 months ago?
Opus 4.5 ~= Opus 4.6 high. Opus 4.5 was nerfed just before or after the release of 4.6.
The models don’t change.
On paper. There's huge financial incentive to quantize the crap out of a good model to save cash after you've hooked in subscriptions.
And there’s an incentive to publish evidence of this to discourage it, do you have any?
Models aren't just big bags of floats you imagine them to be. Those bags are there, but there's a whole layer of runtimes, caches, timers, load balancers, classifiers/sanitizers, etc. around them, all of which have tunable parameters that affect the user-perceptible output.
There really always is a man behind the curtain eh?
Often it's literally just that:
https://www.msn.com/en-us/money/other/ai-startup-backed-by-m...
It's still engineering. Even magic alien tech from outer space would end up with an interface layer to manage it :).
ETA: reminds me of biology, too. In life, it turns out the more simple some functional component looks like, the more stupidly overcomplicated it is if you look at it under microscope.
There's this[1]. Model providers have a strong incentive to switch (a part of) their inference fleet to quantized models during peak loads. From a systems perspective, it's just another lever. Better to have slightly nerfed models than complete downtime.
[1]: https://marginlab.ai/trackers/claude-code/
So - as the charts say - no statistical difference?
Isn't this link am argument against the point you are making?
The chart doesn't cover the 4.6 release which was in the end of December/early January time frame. So, it's hard to tell from existing data.
Anybody with more than five years in the tech industry has seen this done in all domains time and again. What evidence you have AI is different, which is the extraordinary claim in this case...
Or just change the reasoning levels.
Real world usage suggests otherwise. It's been a known trend for a while. Anthropic even confirmed as such ~6 months ago but said it was a "bug" - one that somehow just keeps happening 4-6 months after a model is released.
Real world usage is unlikely to give you the large sample sizes needed to reliably detect the differences between models. Standard error scales as the inverse square root of sample size, so even a difference as large as 10 percentage points would require hundreds of samples.
https://marginlab.ai/trackers/claude-code/ tries to track Claude Opus performance on SWE-Bench-Pro, but since they only sample 50 tasks per day, the confidence intervals are very wide. (This was submitted 2 months ago https://news.ycombinator.com/item?id=46810282 when they "detected" a statistically significant deviation, but that was because they used the first day's measurement as the baseline, so at some point they had enough samples to notice that this was significantly different from the long-term average. It seems like they have fixed this error by now.)
It's hard to trust public, high profile benchmarks because any change to a specific model (Opus 4.5 in this case) can be rejected if they have regressions on SWE-Bench-Pro, so everything that gets to be released would perform well in this benchmark
Any other benchmark at that sample size would have similarly huge error bars. Unless Anthropic makes a model that works 100% of the time or writes a bug that brings it all the way to zero, it's going to work sometimes and fail sometimes, and anyone who thinks they can spot small changes in how often it works without running an astonishingly large number of tests is fooling themselves with measurement noise.
They do. I'm currently seeing a degradation on Opus 4.6 on tasks it could do without trouble a few months back. Obvious I'm a sample of n=1, but I'm also convinced a new model is around the corner and they preemptively nerf their current model so people notice the "improvement".
Make that 2, I told my friends yesterday "Opus got dumb, new model must be coming".
I swear that difference sessions will route to different quants. Sometimes it's good, sometimes not.
You sure about that?
https://marginlab.ai/trackers/claude-code/
Well, I don't see 4.5 on there ... so I'm not sure what you're trying to say.
And today is a 53% pass rate vs. a baseline 56% pass rate. That's a huge difference. If we recall what Anthropic originally promised a "max 5" user https://github.com/anthropics/claude-code/issues/16157#issue... -- which they've since removed from their site...
50-200 prompts. That's an extra 1-6 "wrong solutions" per 5 hours ... and you have to get a lot of wrong answers to arrive at a wrong solution.
Only nominally...
Oh yes, they do.
I think the conspiracy theories are silly, but equally I think pretending these black boxes are completely stable once they're released is incorrect as well.
No conspiracy theories. Companies being scumbags, cutting corners, and doctoring benchmarks while denying it. Happens since forever.
You cannot afford the SOTA.
Why is that? The $200 per month subscription comes with a ton of usage.
Opus 4.6 is available on the $20 plan too
> The $200 per month subscription comes with a ton of usage.
$200 dollars + VAT is half of my rent.
I know HN is not a good place to rant on this subject, but I'm often flabbergasted about the number of people here that lives in a bubble with regard to the price of tech. Or just prices in general.
I remember someone who said a few years ago (I'm paraphrasing): "You could just use one of the empty room in your house!". It was so outlandish I believed it was a joke at first.
EDIT: "not", minor grammar
Thanks for the alternative perspective.
I think I am in the middle. I can afford $200/m but it'd be a brainer. And I don't pay that as I barely use home AI enough to warrant it.
I am also amazed at the richer end of HN but now I realize I am priviledged. Earned it? Like fuck I did. Lucky to be born a geek in late 20c. I'd be useless as a middle ages guy.
If I found myself in the middle ages I’d just become a blacksmith or a miller.
Do you have the genetics for that? It takes a lot of raw strength, and not that much intelligence.
The other part of the bubble is assuming working in projects that allow disclosing any code or project details to a generic third party with that kind of power asymmetry.
That's why ai is for the "rich". Poor people or later on middle class will be left behind....
Nah, that's why you cannot not afford the subscriptions these days. Whatever your needs, ever since Claude Code became a thing, subscription costs come out massively cheaper than pay-as-you-go per-token API pricing. Also SOTA models are so much better than anything else, that using older or open models will just cost you more in tokens/electricity than going for SOTA subscription.
Subscriptions are definitely middle-class targeted. $20/month is not much for the value provided, at least not in the western world.
But if by "rich" you just mean "westerners", then in this sense, the same is and has always been true for computing in general.
The subscriptions are purposely sold for less than cost. The subsidy will end some day.
We'll cross that bridge when we come to it. Especially in context of discussing living at different economic strata, customers are neither expected nor supposed to voluntarily overpay out of a belief this will make an industry not try to rugpull everyone at some point.
Not sure. AI is sort of car ownership price. I think while that ain't poor, that is middle class.
So like if you want to start a business of any sort the AI sub is still peanuts.
AI is a car, or a dog, or a mild social life, or a utility bill level of cost. And thats for the level needed for a sane typical developer. (AI maximalists need 250k/y, let them slop it out)
It is not a Cessna, an infinity pool or a 1 month vacation.
It’s a good reminder. Claude Max costs about as much as the global poverty line ($3/day.) I think it’s okay to invest in it, but we should try to make sure it’s worthwhile, and also invest in charity.
$200/mo is a lot, sure, but the shocking part of that comparison is your rent. I didn’t know $400/mo apartments still existed. For most people in the US and EU, $200 would be closer to 15%-20% of rent I think? My cell phone bill for my family is almost $200/mo.
Last year, at first, $200 seemed crazy. Now that I’m getting addicted to coding agents, not so much. Some companies are paying API rates for AI for employees, and it’s a lot more than $200/mo. It seems like funny money, and I’m not sure it’ll last.
It is my belief that rent price scales with the leftover income people have after they've paid for other necessities. Ie if you're from a poorer country/area then things like milk and gasoline will cost a similar amount (maybe 2x difference), but rent will cost a lot less. As people in a country get richer they start paying a larger and larger share of their income as rent of various forms.
Even the US has places with cheap rent/housing. The downside is that there's no (well-paying) work nearby.
It’s true that average rent prices are regional and poorer areas have lower rents, but that doesn’t tend to make much difference in urban areas and large cities where the majority of people live now. Why do you feel that rent scales with disposable income? Economists generally say the opposite based on housing being a core necesessity; that people pay rent in proportion to their income, and only what’s left over the the disposable amount. That’s why we have the 30% rule, for example.
You’re technically correct, btw, rental housing is a market and is subject to market forces, meaning what people are willing to pay. I’m just not so sure about framing rent as being lower priority than other necessities. And rent prices have been increasing faster than other necessities, and faster than income, so that might be a confounding factor in your argument.
Still, my initial reaction above is due to the fact that in the US and in Europe in most large cities, the average rent is north of $1000/mo.
In the US/Western Europe? Because for devs especially in the former, $200 is pocket change, especially for a core productivity tool. And the rent would be in the $1200 to $3000 easily. Same for houses. Maybe not in NY or SF, but in most of the US there's no shortage of house spaces and redundant rooms.
I've seen those comments about $200/month and empty rooms here, so I suppose they mainly come from the US, yes.
So yes, you describe a situation that I feel like a lot of people here don't understand is not the norm.
I compared the subscription with my rent precisely because it's easier to compare: with your numbers it would be like paying from $600 up to $1500 / month. Pretty hard to justify.
> Because for devs especially
Are you not a dev? If not, what would you use a coding tool for? They still require handholding for anything largeish. Still much cheaper than outsource.
You think I don't understand that? I'm friends with people who make little more than that amount per month.
But it's not all that relevant to this conversation. It's not like this is the first time economic inequality is a thing.
It's just as relevant to me factoring in your salary the next time I go buy a car.
First, I've assumed you were in the bubble I described, but that's not the case, so sorry bout that.
Also, I think it's relevant to the conversation.
You replied to someone who said that "you" (undirected pronoun I suppose) can't afford the SOTA that the $200/month Anthropic subscription comes with a ton of usage. So I interpreted it as a general statement. It wasn't what you meant?
I'm a bit lost about who you're talking to/about in your first comment: the person you respond to, a general statement for everyone reading, or yourself?
I assume when somebody says you and is not talking about anyone in particular they mean that it's infeasible for virtually everybody which is certainly not the case. Also you conveniently disregarded the fact that is available on the $20 per month plan.
Okay, I understand better. I interpreted your answer as "well, it's $200, everybody can afford it". Clearly a misunderstanding.
Going back to the $20 plan, yes, I agree it's much more accessible.
I didn't talk about it because I've seen a lot of comments here, on blogs, on social media about how a $200 subscription for Claude is a no brainer. And it got on my nerves, so I wanted to tell how much money it can be. To you (and it was misguided reading your answers), and to concerned HN commenters in general.
For me I pass the token costs off to my clients. Not everyone is a hobbyist burning their own cash on personal projects
Work pays.
I'm not sure I've correctly understood what you're implying.
If it's that I'm not working, well, I'm employed.
It it's that I'm not working enough to not have this money... Well, we still go back to the bubble. Not everywhere in the world you can easily find a job that pays you enough, even if you accept to work more. And the employer will not accept to give developers a $200/month subscription, even less for personal use.
If it's that I'm not working enough and I should go freelancing to work as much as I want and get rich (I'm extrapolating). Well, you're right, I could do that. But (at least at first), I would work a lot more for much less money. And even if I become a recognized freelancer, it doesn't change the fact that I'll earn less money compared to the baseline of SF, or even the USA in the tech sector in general. So, bubble again. I could also, like someone said, put the tokens cost into my hourly/daily rate, but I'll be much more expensive than other freelancers.
Also, but that's a "me case" compared to my previous points, health issues can greatly affect how much work you can do.
> I could also, like someone said, put the tokens cost into my hourly/daily rate, but I'll be much more expensive than other freelancers.
Do you have any evidence of that? I think the OPs are assuming this as a premise so their logic is probably valid but may not be sound logic for you.
I don't have any hard evidence, no.
Instinctively, if we suppose all the newbies freelancers without any reputation start with the same lowest rate possible to be competitive, passing additional cost to my client will mechanically increase my rate. Putting me in disadvantage about getting any work. And with the difference of monetary value for the same price of tokens, the rate delta is higher.
It's a simplified model of the world, but it feels like simple economic rules.
I assume the comment I'm referring to was written by someone who is already established and for Wich the token cost passing is lower relatively to my environment.
Calm down. I meant that my work covers my pro subscription.
I guess what was meant is that those tools are generally bought by the employer
>I'm often flabbergasted about the number of people here that lives in a bubble with regard to the price of tech
Sorry, no. You live in the bubble, the people you think are living in a bubble are actually doing the very opposite and taking advantage of the lack of bubbles in our globally connected world.
Today, basically anyone can sell any bullshit to billions of people around the world. We’ve never lived in less of a bubble.
I guess all those people who live in not-SF just can't be bothered to succeed!
$20/month is not above middle class in most of the world.
$200/month is, but you don't need that for anything except beyond-casual use of coding agents.
To be fair if you think only people in SF can afford that you do kind of live in a bubble.
Nobody in this thread claimed that.
The person you were replying to was not talking about SF but you specifically called out SF so you were implying that
The thread started with "$200 is a lot for most of the world", the person I was replying to said "no it's not, now anyone can sell to billions of people", and I said "company success being concentrated in SF shows that that's not true".
I didn't say "only SF can afford $200/mo".
"I guess all those people who live in not-SF just can't be bothered to succeed!"
I explained it in my previous comment, I'm not going to explain it more than that.
Again, if you think that only successful companies are in SF you live in a bubble.
I dunno how you guys even go throuh the $200 subscription. I use it every day for work and side projects doing tasks in parallel and Im no where newr the limit on $100.
A subscription for coding - no thanks.
If you think it's only for coding you don't have much of an imagination :)
These are the types of individuals that become so left in the dust that they don't realize what's going on anymore, and it's obvious this person is already there. Claude hasn't been a "subscription for coding" product for quite some time now. That's how it started out and while that's certainly what Claude is known for, Anthropic has been pushing for Claude to also be a general productivity tool -- Claude Code, then Claude Desktop, Claude Work, and now Claude Desktop has Chat, Work, and Code essentially built into a single desktop app that just works wonders for those who are looking for a general productivity tool.
I'd not use it over pure Claude Code because I am at heart a coder and I want the raw terminal experience and there's some features missing from the "Code" tab in Claude Desktop, but just saying "a subscription to code", just goes to show how out of touch that person already is, and that's what resistance does to you when you try to resist making use of any kind of modern tooling or technology.
> The $200 per month subscription comes with a ton of usage.
200 USD/month is a number only really affluent programmers (e.g. in the Silicon Valley) can perhaps pay easily.
The $100 already gives plenty of usage and is more than worth it, and I'm definitely not an affluent SV developer. I've only ever hit the 5h limit once in the last month, although I rarely run more than 3 agents at once, and I don't use ridiculously expensive tools like Gas Town.
"Opus 4.6 is available on the $20 plan too"
Anthropic’s $20 plan gives you such a pittance of tokens that it’s borderline unusable for anything more than a few scripts or a toy app. If $20 is all you have you’d do _much_ better going with chatgpt
The Codex plan for the $20 ChatGPT plan goes much further than Claude's $20 plan, but it's still not enough if you plan to work full-time with it.
My usage is in the $60 tier, but that doesn't exist so I have to cough up $100. And then get all shaky if I don't use up my weekly quota.
Do you mostly just hit the session limits? If so I know it's not ideal but you could wait an hour or two for that to reset. Not sure if that would work for you but just a suggestion
I get to 80% when on a single session and cap out a hour off the rest if I’m working on two.
But I like to have that forced hour to stop, it’s moment to take a breath.
It depends on the kind of work though, some things are more token intensive.
That's simply not true at all.
Are you kidding me? Even developer salaries in the Philippines can afford that or at least the plan below it. If I used the Anthropic API, my monthly spend would be $4k a month. The Claude Max plan is the best bargain around.
> 200 USD/month is a number only really affluent programmers (e.g. in the Silicon Valley) can perhaps pay easily.
Not true, I live in USA PNW and my last remote job paid $12k/mo. I have been jobless for over a month now (currently waiting for the next HN "who wants to be hired"), but I still have enough savings to easily afford to continue that plan for a while.
I don't think it really has to do with affluence but more the job market and economy you're in. Countries with lower salaries or higher costs of living will have less buying power.
I'm starting to think in these conversations we're all often talking about two different things. You're talking about running an LLM service through its provided tooling (codex, Claude, cursor), others seem to be talking token costs because they're integrating LLMs into software or are using harness systems like opencode, pi, or openclaw and balancing tasks across models.
Fair enough, I read it quickly and assumed the person they replied to was talking about Claude Code
But I run a AI SaaS and we do offer Opus 4.6, too. Our use case is not nearly as token intensive as something like coding so we are still able to offer it with a good profit margin.
Also you can run OpenClaw with your CC subscription. It's what I do.
I wrap Opus 4.5 in a consumer product with 0 economic utility and people pay for it, I'm sure plenty of end users are willing to pay for it in their software.
Edit: I'm not using the term of art, I mean it literally cannot make them money.
> [...] in a consumer product with 0 economic utility and people pay for it, [...]
Sorry, how do these two things go together?
If people pay for it, it has economic utility, doesn't it? I mean, people pay to watch movies or play video games, too.
I get decent results with Kimi, but I agree with your overall premise. You do need to realise that while you can save money on a lot of tasks with those models, for the hardest tasks the "sticker price" of cost per million tokens isn't what matters.
It's also worth noting that the approach given in the link also benefits Sonnet and Opus. Not just as much - they are more forgiving - but put it in a harness that allows for various verification and repair and they too end up producing much better results than the "raw" model. And it's not clear that a harness around MiniMax, Kimi, or Qwen can measure up then.
I use those models a lot, and hope to use them more as my harnesses get better at discriminating which tasks they are cost effective for, but it's not straightforward to cost optimize this.
If I cared about running everything locally, then sure, it's amazing you can get to those kinds of results at all.
Yup, they do quite poorly on random non-coding tasks:
https://aibenchy.com/compare/minimax-minimax-m2-7-medium/moo...
Interesting benchmark. It is notable that Gemini-3-Flash outperforms 3.1 Pro. My experience using Flash via Opencode over the past month suggests it is quite underrated.
Needless to say, benchmarks are limited and impressions vary widely by problem domain, harness, written language, and personal preference (simplicity vs detail, tone, etc.). If personal experience is the only true measure, as with wine, solving this discovery gap is an interesting challenge (LLM sommelier!), even if model evolution eventually makes the choice trivial. (I prefer Gemini 3 for its wide knowledge, Sonnet 4.6 for balance, and GLM-5 for simplicity.)
Wild benchmark. Opus 4.6 is ranked #29, Gemini 3 Flash is #1, front of Pro.
I'm not saying it's bad, but it's definitely different than the others.
The main reason is that Claude models tend to ignore instructions. There is a failure example on the Methodology page.
> It is not my fault if Claude outputs something like "*1*, *1*", adding markdown highlighting, when most other models respect the required format correctly.
Yuck. At that point don't publish a benchmark, explains why their results are useless too.
-
Edit since I'm not able to reply to the below comment:
"I want structured output from a model that supports structured output but will not enable structured output, nor ask for an existing format like XML or JSON" is not really an interesting thing to benchmark, and that's reflected in how you have Gemini 2.5 Flash beating GPT-5.4.
I really hope no one reads that list and thinks it's an AI leaderboard in any generalizable sense.
Why not? I described this in more detail in other comments.
Even when using structured output, sometimes you want to define how the data should be displayed or formatted, especially for cases like chat bots, article writing, tool usage, calling external api's, parsing documents, etc.
Most models get this right. Also, this is just one failure mode of Claude.
Like I said in the edit, when people want specific formatting they ask for well known formats: Markdown, XML, JSON
I don't even need to debate if the benchmark is useful, it doesn't pass a sniff test: GPT-5.4 is not worse than Gemini 2.5 Flash in any way that matters to most users. In your benchmark it's meaningfully worse.
It’s worth also comparing Qwen 3.5, it’s a very strong model. Different benchmarks give different results, but in general Qwen 3.5, GLM 5, and Kimi K2.5 are all excellent models, and not too far from current SOTA models in capability/intelligence. In my own non-coding tests, they were better than Gemini 3.1 flash. They’re comparable to the best American models from 6 months ago.
While I like these models, if you're getting similar results to SOTA models from 6 months ago, I have to question how far you pushed those models 6 months ago. It is really easy to find scenarios were these models really underperform. They take far more advanced harnesses to perform reasonably (and hence the linked project). It's possible to get good results out of them, but it takes a lot of extra work.
I badly want to shift more of my work to them, and I'm finding ways of shifting more lower-level loads to them regularly, but they're really not there yet for anything complex.
I used qwen 3.5 plus in production, it was really good at instruction following and tool calling.
we used Kimi 2.5, its really good
I can't imagine anyone looking at this benchmark without laughing. It's so disconnected.
GLM 5 here is significantly better than GPT-5.4
[dead]
Not really related, but does anybody know if somebody's tracking same models performance on some benchmarks over time? Sometimes I feel like I'm being A/B tested.
Oh, I didn't think about this, that's a good idea. I also feel generally model performance changes over time (usually it gets worse).
The problem with doing this is cost. Constsntly testing a lot of models on a large dataset can get really costly.
Yeah, good tests are associated with cost. I'd like to see benchmarks on big messy codebases and how models perform on a clearly defined task that's easy to verify.
I was thinking that tokens spent in such case could also be an interesting measure, but some agent can do small useful refactoring. Although prompt could specify to do the minimal change required to achieve the goal.
> I’d encourage devs to use MiniMax, Kimi, etc for real world tasks that require intelligence.
I use MiniMax daily, mostly for coding tasks, using pi-coding-agent mostly.
> The down sides emerge pretty fast: much higher reasoning token use, slower outputs, and degradation that is palpable.
I don't care about token use, I pay per request in my cheap coding plan. I didn't notice slower outputs, it's even faster than Anthropic. Degradation is there for long sessions with long contexts, but that also happens with Anthropic models.
> Sadly, you do get what you pay for right now. However that doesn’t prevent you from saving tons through smart model routing, being smart about reasoning budgets, and using max output tokens wisely. And optimize your apps and prompts to reduce output tokens.
Exactly. For my use case, I get 1500 API requests every 5 hours for 10€ monthly. I never hit the limit, even during the intensive coding sessions.
What I notice is, while Opus and Sonnet feel better for synthetic benchmarks, it doesn't matter in the real world. I never put so much effort into coming up with a perfect problem spec like the ones in benchmarks. I don't craft my prompts for hours expecting the LLM to one-shot a working program for me. And that's exactly what all those benchmarks are doing. And that's where Anthropic tools shine in comparison to cheaper Chinese models.
When it comes to the real world, where I put my half-baked thoughts in broken English in a prompt and execute 20 prompts in half an hour, the difference between Opus, Sonnet, and MiniMax is minimal, if at all. There, I don't want to think about costs and token savings and switching between different Anthropic models. I just use MiniMax, and that's it.
Yes, MiniMax sometimes gets stuck. Then I switch to Opus to unblock it. But the same happens if I use Opus the whole session. It gets stuck eventually, and model switch is sometimes required to get a fresh perspective on the problem.
The only difference is, using Opus or Sonnet quickly eats up my budget, while with MiniMax I have basically unlimited usage (for my coding use case) for 10€ per month.
I've only been using free tokens for a year now. Gemini and they just dropped pro so I switched to minimax. Bit of a hurdle switching from Gemini-cli to kilo-cli, but now I can't really see too much difference.
If I was starting new projects I'd pay for a better model, but honestly I don't really know any different.
I've not ever used Claude and people seem to rave about it. Maybe its good, but I doubt its $200/month good.
When I hit issues with these lower models I think hard about creating the right tooling - agnostic to the harness and I feel like maybe its more work but I can carry those tools to any setup going forward. That's how it was in the early Linux days so why change what clearly works?
I've used Gemini and now claude. Both were meh until I found the superpowers skill. Will be trying chatgpt next month.
You can "feel" the llm being limited with Gemini, less so with Claude. Hopefully even less so with chatgpt
What is this 10€ per month subscription that you are talking about?
MiniMax token plan
https://platform.minimax.io/docs/guides/pricing-token-plan
How is the speed and stability?
These small Chinese companies dont always have access to serious hardware.
Minimax 2.7 is fine for most web stuff. It's slightly worse than Claude at backend, but works great for frontend.
They're all slop when the complexity is higher than a mid-tech intermediate engineer though.
Kimi is surprisingly good at Rust.
> They're all slop when the complexity is higher than a mid-tech intermediate engineer though.
This right here. Value prop quickly goes out the window when you're building anything novel or hard. I feel that I'm still spending the same amount of time working on stuff, except that now I'm also spending money on models.
10x more code output is 10x more review.
We've gone from doing the first 90% and then the second 90% to the first 90% and the second 990%, its exausting.
Kimi's been one of my goto options lately and it oftentimes outperforms both Claude and GPT in debugging, finding the actual problem immediately while the other two flail around drunkenly.
It does have some kind of horrible context consistency problem though, if you ask it to rewrite something verbatim it'll inject tiny random changes everywhere and potentially break it. That's something that other SOTA models haven't done for at least two years now and is a real problem. I can't trust it to do a full rewrite, just diffs.
And what tooling do you use with that? In my experience, there is quite a bit of difference between using, say, OpenCode, or the commercial offerings.
No tooling, just manual use. When doing these comparisons I gather and format all the data they need to figure out the problem, and paste the same thing into all models so it's a pretty even eval.
I doubt Kimi would do well with most harnesses, its outputs are pretty chaotic in terms of formatting but the inteligence is definitely there.
yea, they are still useful. But yea not close to Claude or GPT. But works good for simple changes. I use a combo of minimax and codex
i find kimi to be very very good, minimax not so much
Agreed.
They are equivalent of frontier models 8+ months ago.
[dead]