Hacker News

alex7o 21 hours ago [ - ]

Ok I find it funny that people compare models and are like, opus 4.7 is SOTA and is much better etc, but I have used glm 5.1 (I assume this comes form them training on both opus and codex) for things opus couldn't do and have seen it make better code, haven't tried the qwen max series but I have seen the local 122b model do smarter more correct things based on docs than opus so yes benchmarks are one thing but reality is what the modes actually do and you should learn and have the knowledge of the real strengths that models posses. It is a tool in the end you shouldn't be saying a hammer is better then a wrench even tho both would be able to drive a nail in a piece of wood.

mikenew 14 hours ago [ - ]

GLM 5.1 was the model that made me feel like the Chinese models had truly caught up. I cancelled my Claude Max subscription and genuinely have not missed it at all.

Some people seem to agree and some don't, but I think that indicates we're just down to your specific domain and usage patterns rather than the SOTA models being objectively better like they clearly used to be.

operatingthetan 14 hours ago [ - ]

It seems like people can't even agree which SOTA model is best at any given moment anymore, so yeah I think it's just subjective at this point.

fwipsy 13 hours ago [ - ]

Perhaps not even necessarily subjective, just performance is highly task-dependent and even variable within tasks. People get objectively different experiences, and assume one or another is better, but it's basically random.

easygenes 9 hours ago [ - ]

Unless you're looking at something like a pass@100 benchmark, the benchmarks are confounded heavily by a likelihood of a "golden path" retrieval within their capabilities. This is on top of uncertainties like how well your task within a domain maps to the relevant test sets, as well as factors like context fullness and context complexity (heavy list of relevant complex instructions can weigh on capabilities in different ways than e.g. having a history where there's prior unrelated tasks still in context).

The best tests are your own custom personal-task-relevant standardized tests (which the best models can't saturate, so aiming for less than 70% pass rate in the best case).

All this is to say that most people are not doing the latter and their vibes are heavily confounded to the point of being mostly meaningless.

operatingthetan 13 hours ago [ - ]

>just performance is highly task-dependent and even variable within tasks. People get objectively different experiences, and assume one or another is better, but it's basically random.

You are right that this is not exactly subjectivity, but I think for most people it feels like it. We don't have good benchmarks (imo), we read a lot about other people's experiences, and we have our own. I think certain models are going to be objectively better at certain tasks, it's just our ability to know which currently is impaired.

api 42 minutes ago [ - ]

They might be converging somewhat. The ultimate limiting factor is training data. Eventually I think they will converge and then the competition will be on memory and compute efficiency, with the best being the smallest maximally capable model.

hamdingers 11 hours ago [ - ]

And the subjectivity is bidirectional.

People judge models on their outputs, but how you like to prompt has a tremendous impact on those outputs and explains why people have wildly different experiences with the same model.

ulfw 7 hours ago [ - ]

AI is a complete commodity

One model can replace another at any given moment in time.

It's NOT a winner-takes-all industry

and hence none of the lofty valuations make sense.

the AI bubble burst will be epic and make us all poorer. Yay

StilesCrisis 2 hours ago [ - ]

Staying power is probably the most important factor, which is why I'm thinking Google eventually takes the crown.

vidarh an hour ago [ - ]

I feel like it's Sonnet level for implementation, but not matching up to Opus for planning.

But I agree it's close enough that it's worth using heavily. I've not cancelled my Claude Max subscription, but I've added a z.ai subscription...

mettamage 5 hours ago [ - ]

Hmm

Will try it out. Thanks for sharing!

abustamam 12 hours ago [ - ]

What is your workflow? Do you use Cursor or another tool for code Gen?

mikenew 7 hours ago [ - ]

I use Opencode, both directly and through Discord via a little bridge called Kimaki.

https://github.com/remorses/kimaki

LoganDark 13 hours ago [ - ]

The value in Claude Code is its harness. I've tried the desktop app and found it was absolutely terrible in comparison. Like, the very nature of it being a separate codebase is already enough to completely throw off its performance compared to the CLI. Nuts.

deaux 11 hours ago [ - ]

> The value in Claude Code is its harness

If this was the case then Anthropic would be in a very bad spot.

It's not, which is why people got so mad about being forced to use it rather than better third party harnesses.

Pi is better than CC as a harness in almost every respect.

enochthered 10 hours ago [ - ]

Anthropic limiting Claude subs to Claude code is what pushed me away in the end because I wanted to keep using Pi.

strel0k1 9 hours ago [ - ]

Just sign up for an AWS account and use the Anthropic models through Bedrock which Pi can use.

seunosewa 8 hours ago [ - ]

API costs are really high compared to subs.

solenoid0937 an hour ago [ - ]

Then you aren't the target market.

deaux 2 hours ago [ - ]

What advantage are you saying this has compared to just directly going through the Anthropic provider? They are the same price.

adrianN 9 hours ago [ - ]

Why use tricks to support a company that is hostile to your use case?

bizzletk 8 hours ago [ - ]

Can you enumerate why?

deaux 6 hours ago [ - ]

- Claude Code has repeatedly had enormous token wastage bugs. Its agent interactions are also inefficient. These are the cause of many of the reports of "single prompt blew through 5-hour quota" even though it's a reasonable prompt.

- It still lacks support for industry standards such as AGENTS.md

- Extremely limited customization

- Lots of bugs including often making it impossible to view pre-compaction messages inside Claude Code.

- Obvious one: can't easily switch between Claude and non-Claude models

- Resource usage

More than anything, I haven't found a single thing that Pi does worse. All of it is just straight up better or the same.

Mashimo 7 hours ago [ - ]

I thought the desktop app used the cli app in the background?

bink-lynch 12 hours ago [ - ]

I have been using GLM-5.1 with pi.dev through Ollama Cloud for my personal projects and I am very happy with this setup. I use pi.dev with Claude Sonnet/Opus 4.6 at work. Claude Code is great but the latest update has me compacting so much more frequently I could not stand it. I don't miss MCP tool calling when I am using pi.dev; it uses APIs just fine. I actually think GML-5.1 builds better websites than Claude Opus. For my personal projects I am building a full stack development platform and GLM-5.1 is doing a fantastic job.

zackify 10 hours ago [ - ]

I'm using pi the same as you. However, I have an MCP I need to use and the popular extension for that support works fine for me.

Really liking pi and glm 5.1!

jadbox 11 hours ago [ - ]

Why use ollama cloud versus like Openrouter?

bink-lynch 5 hours ago [ - ]

The limits seem higher on Ollama Cloud to me than paying for API access. I don't have solid stats on that though. I have an OpenRouter account and the service I am creating is going to need to use that. I will have better measuring stick then.

zackify 10 hours ago [ - ]

Recently it had great limits but this month I'm trying open router directly.

jxmesth 19 hours ago [ - ]

The only reason I'm stuck with Claude and Chatgpt is because of their tool calling. They do have some pretty useful features like skills etc. I've tried using qwen and deepseek but they can't even output documents. How are you guys handling documents and excels with these tools? I'd love to switch tbh.

embedding-shape 19 hours ago [ - ]

> I've tried using qwen and deepseek but they can't even output documents

What agent harness did you use? Usually, "write_file", "shell_exec" or similar is two of the first tools you add to an agent harness, after read_file/list_files. If it doesn't have those tools, unsure if you could even call it a agent harness in the first place.

jxmesth 18 hours ago [ - ]

Sorry for the confusion, I was actually talking about their Web based chat. Since most of my work is governance and docs, I just use their Web chats and they just refuse to output proper documents like Claude or Chatgpt do.

embedding-shape 18 hours ago [ - ]

Aha... Well, I let Codex (Claude Code would work too) manage/troubleshoot .xlsx files too, seems to handle it just fine (it tends to un-archive them and browse the resulting XML files without issues), seen it do similar stuff for .app and .docx files too so maybe give that a try with other harnesses/models too, they might get it :)

jxmesth 7 hours ago [ - ]

Yeah, it's just way easier to do via the web/mobile app but I'll give using it via the CLI a try. Thanks :)

noduerme 16 hours ago [ - ]

You're not giving an AI command line access to your work computer? How do you expect to keep up? /s

dymk 16 hours ago [ - ]

You give it command line access in a VM...

ycui1986 12 hours ago [ - ]

i give it in real ubuntu, no vm, no docker. so long I don't ask it to organize files, it will behave. it has not screw me so far.

DeathArrow 9 minutes ago [ - ]

I only run it with --dangerously-skip-permissions. YOLO!

dymk 12 hours ago [ - ]

Godspeed

koen_hendriks 15 hours ago [ - ]

You mean a VM like the one that contains a 0day that can escape the sandbox that gets found every year at pwn2own?

enneff 14 hours ago [ - ]

Presumably you’re also using a browser to view this web page. There have also been vulnerabilities in that. You have to draw a line somewhere.

andai 14 hours ago [ - ]

I run mine as a separate unprivileged user. (No VM.) Am I pwned?

dymk 12 hours ago [ - ]

Maybe, but the sort of 0days you're talking about aren't exploited in any meaningful way for almost all developers.

arcanemachiner 10 hours ago [ - ]

"Seatbelts don't save the life of everyone who gets into an accident, so why bother wearing one?"

chillfox 12 hours ago [ - ]

You can make a harness fully functional with just the "shell_exec" tool if you give it access to a linux/unix environment + playwright cli.

ecocentrik 19 hours ago [ - ]

When was the last time you used Qwen models? Their 3.5 and 3.6 models are excellent with tool calling.

jxmesth 18 hours ago [ - ]

I gave it a try a few weeks ago tbh, I'll give it another shot tho. I mainly use their Web chats since that's easier to use and previously, qwen, deepseek, kimi, all were unable to output proper docx files or use skills.

ecocentrik 18 hours ago [ - ]

Try loading the models up in a coding harness like Claude Code. There's a few docx skills listed on Vercel's skill index.

https://skills.sh/tfriedel/claude-office-skills/docx

ycui1986 10 hours ago [ - ]

outputting docx files does not have much to do with model capability. it is about whether tool calling has be configured .

sscaryterry 18 hours ago [ - ]

You can use GLM-5.1 with claude code directly, I use ccs, GLM-5.1 setup as plan, but goes via API key.

zrn900 3 hours ago [ - ]

You can just use Cline in VSCode to get most of the tooling you need - it works with all models. Including Xiaomi's new Mimo with 1m context window and blazing fast speed. It's much cheaper than Claude's biggest plan and with much, much more quota.

NobleLie 14 hours ago [ - ]

Yep Claude Code CLI does A LOT (which is now confirmed even more)

jwitthuhn 19 hours ago [ - ]

I've been using qwen-code (the software, not to be confused with Qwen Code the service or Qwen Coder the model) which is a fork of gemini-cli and the tool use with Qwen models at least has been great.

ycui1986 12 hours ago [ - ]

qwen3.5 and qwen3.6 are both good at tool calling.

estimator7292 17 hours ago [ - ]

You can use both codex and Claude CLI with local models. I used codex with Gemma4 and it did pretty well. I did get one weird session where the model got confused and couldn't decide which tools actually existed in its inventory, but usually it could use tools just fine.

vidarh an hour ago [ - ]

I don't find GLM 5.1 beating Opus personally, but I do think it is good enough to consider it part of the SOTA pack at this point. It feels like it needs more time and tokens to achieve things, but that's okay - it's so much cheaper per token.

If Qwen3.6-Max is up there as well, it will be very interesting.

Moosdijk 19 hours ago [ - ]

I wonder why glm is viewed so positively.

Every time I try to build something with it, the output is worse than other models I use (Gemini, Claude), it takes longer to reach an answer and plenty of times it gets stuck in a loop.

pkulak 19 hours ago [ - ]

I've been running Opus and GLM side-by side for a couple weeks now, and I've been impressed with GLM. I will absolutely agree that it's slow, but if you let it cook, it can be really impressive and absolutely on the level of Opus. Keep in mind, I don't really use AI to build entire services, I'm mostly using it to make small changes or help me find bugs, so the slowness doesn't bother me. Maybe if I set it to make a whole web app and it took 2 days, that would be different.

The big kicker for GLM for me is I can use it in Pi, or whatever harness I like. Even if it was _slightly_ below Opus, and even though it's slower, I prefer it. Maybe Mythos will change everything, but who knows.

tasuki 17 hours ago [ - ]

> The big kicker for GLM for me is I can use it in Pi, or whatever harness I like.

Yes, but... isn't the same true for Opus and all the other models too?

slopinthebag 17 hours ago [ - ]

Opus is about 7 times more expensive than GLM with API pricing. And since you can only use the Opus subscription plan in CC, you're essentially locked into API pricing for Pi and any other harness.

So you're either paying $1000's for Opus in Pi, or $30/month for GLM in Pi. If the results are mostly equivalent that's an easy choice for most of us.

tasuki 17 hours ago [ - ]

Perhaps I'm being extremely daft: If the API is 7 times more expensive, then why is it $1000 vs $30? Or is there a GLM subscription one can use with Pi? Certainly not available in my (arguably outdated) Pi.

RussianCow 16 hours ago [ - ]

I'm not the OP, but it's the latter. I'm currently using the "Lite" GLM subscription with OpenCode, for example. I'm not using it very heavily, but I haven't come close to hitting the limits, whereas I burned through my weekly limits with Claude very regularly.

bink-lynch 12 hours ago [ - ]

I am using GLM-5.1 in pi.dev through Ollama Cloud. I am able to get by on the $20 plan. I use it a lot and the reset is hourly for sessions and weekly overall. This is the first week I got close to the limit before reset at about 85% used. I am probably using it about 4 hours a day on average 6 or 7 days per week.

girvo 16 hours ago [ - ]

You can use GLM’s coding plan in Pi, just use the anthropic API instead of the OpenAI compatible one they give.

probst 16 hours ago [ - ]

Or tell pi to add support for the coding plan directly. That gave me GLM-5.1 support in no time along with support for showing the remaining quota, etc, too.

It also compresses the context at around 100k tokens.

In case anyone is interested: https://github.com/sebastian/pi-extensions/tree/main/.pi/ext...

Mashimo 19 hours ago [ - ]

I have used GLM 4.7, 5 and 5.1 now for about 3 month via OpenCode harness and I don't remember it every being stuck in a loop.

You have to keep it below ~100 000 token, else it gets funny in the head.

I only use it for hobby projects though. Paid 3 EUR per month, that is not longer available though :( Not sure what I will choose end of month. Maybe OpenCode Go.

Mashimo 5 hours ago [ - ]

EDIT: Ok, now I tried GLM for the first time in the morning CET, and it was .. bad. The reasoning took 5 mintues for a very very small .html file going around in circles.

Evening CET experience for me is super smooth.

gck1 14 hours ago [ - ]

That's unfortunate. 70-80k tokens is roughly the point where I start wrapping up with giving agent required context even on the small to medium sized requests.

That would leave almost no tokens for actual work

chillfox 4 hours ago [ - ]

GLM is the first open source model that actually worked for me, where I found the output ok.

And yes, sonnet/opus is better and what I use daily. But I wouldn’t be that upset if I had to drop down to GLM.

Akira1364 19 hours ago [ - ]

IDK about GLM but GPT 5.4 Extra High has been great when I've used it in the VS Code Copilot extension, I see no actual reason Opus should consume 3x more quota than it the way it does

spaceman_2020 17 hours ago [ - ]

I think it offers a very good tradeoff of cost vs competency

4.7 is better, but its also wildly expensive

slopinthebag 18 hours ago [ - ]

You're probably just holding it wrong.

ternaryoperator 20 hours ago [ - ]

The models test roughly equal on benchmarks, with generally small differences in their scores. So, it’s reasonable to choose the model based on other criteria. In my case, I’d switch to any vendor that had a decent plugin for JetBrains.

blurbleblurble 8 hours ago [ - ]

Opus 4.6 was incredible but Opus 4.7 is genuinely frustrating to me so far. It's really sharp but can be so lazy. It's constantly telling me that we should save this for tomorrow, that it's time for bed (in the middle of the day), and very often quite sloppy and bold in its action. These adjustments are getting old. The next crop of open models seems ready to practically replace the big ones as sharp orchestrator agents.

chillfox 4 hours ago [ - ]

I have never seen a model be “lazy” before (I have seen them go for minimal change). I have been using the models through the api with various agents and no custom system prompt.

So I am curious, how do people get these lazy outputs?

Is it by having one of those custom system prompts that basically tells the model to be disrespectful?

Or is it free tier?

Cheap plans?

enraged_camel 4 hours ago [ - ]

I have seen some people complain about a new tendency where it can suggest wrapping up the current task even though it isn't done yet. I haven't seen it myself though.

solenoid0937 an hour ago [ - ]

Usually this gets worse if you have a phrase like "wrap it up" earlier in the output, or if you're at a few hundred thousand tokens without compacting.

In both cases the fix is really simple, just compact.

szundi 4 hours ago [ - ]

[dead]

ezekiel68 18 hours ago [ - ]

Qwen3-Coder produced much better rust code (that utilized rust's x86-64 vectorized extensions) a few months ago than Claude Opus or Google Gemini could. I was calling it from harnesses such as the Zed editor and trae CLI.

I was very impressed.

gck1 15 hours ago [ - ]

I think claude in general, writes very lazy, poor quality code, but it writes code that works in fewer iterations. This could be one of the reasons behind it's popularity - it pushes towards the end faster at all costs.

Every time codex reviews claude written rust, I can't explain it, but it almost feels like codex wants to scream at whoever wrote it.

lambda 12 hours ago [ - ]

Their latest, Qwen3.6 35B-A3B is quite capable, and fast and small enough I don't really feel constrained running it locally. Some of the others that I've run that seem reasonably good, like Gemma 4 31B and Qwen3.5 122B-A10B just feel a bit too slow, or OOM my system too often, or run up on cache limits so spend a lot of time re-processing history. But the latest Qwen3.6 is both quite strong, and lightweight enough that it feels usable on consumer hardware.

justincormack 17 hours ago [ - ]

Codex is pretty good at Rust with x86 and arm intrinsics too, it replaced a bunch of hand written C/assembly code I was using. I will try Qwen and Kimi on this kind of task too.

sirnicolaz 18 hours ago [ - ]

Consider that SWE benchmarking is mainly done with python code. It tells something

cornedor 20 hours ago [ - ]

I tried GLM and Qwen last week for a day. And some issues it could solve, while some, on surface relatively easy, task it just could not solve after a few tries, that Opus oneshotted this morning with the same prompt. It’s a single example ofcourse, but I really wanted to give it a fair try. All it had to do was create a sortable list in Magento admin. But on the other hand, GLM did oneshot a phpstorm plugin

dev_l1x_be 19 hours ago [ - ]

Do you use Opus through the API or with subscription? Did you use OpenCode or Code?

cornedor 17 hours ago [ - ]

Opus trough Claude Code, the Chinese models trough OpenCode Go, which seems like a great package to test them out.

odie5533 17 hours ago [ - ]

If you showed me code from GLM 5.1, Opus 4.6, and Kimi K2.6, my ranking for best model would be highly random.

mkhalil 6 hours ago [ - ]

Not to mention, that Opus cost orders of magnitude more money. These are VERY impressive and usage.

FAANGS love to give away money to get people addicted to their platforms, and even they, the richest companies in the world, are throttling or reducing Opus usage for paying members, because even the money we pay them doesn't cover it.

Meanwhile, these are usable on local deployments! (and that's with the limited allowance our AI overlords afford us when it comes to choices for graphics cards too!)

FlyingSnake 20 hours ago [ - ]

I tried GLM5.1 last week after reading about it here. It was slow as molasses for routine tasks and I had to switch back to Claude. It also ran out of 5H credit limit faster than Claude.

bensyverson 20 hours ago [ - ]

If you view the "thinking" traces you can see why; it will go back and forth on potential solutions, writing full implementations in the thinking block then debating them, constantly circling back to points it raised earlier, and starting every other paragraph with "Actually…" or "But wait!"

nothinkjustai 20 hours ago [ - ]

I see this with Opus too.

girvo 16 hours ago [ - ]

Indeed. And that’s with Anthropic hiding reading traces unlike these other comparisons.

FlyingSnake 19 hours ago [ - ]

> "Actually…" or "But wait!"

You’re absolutely right!

Jokes apart, I did notice GLM doing these back and forth loops.

tonyarkles 19 hours ago [ - ]

I was watching Qwen3.6-35B-A3B (locally) doing the same dance yesterday. It eventually finished and had a reasonable answer, but it sure went back and forth on a bunch of things I had explicitly said not to do before coming to a conclusion. At least said conclusion was not any of the things I'd said not to do.

Lerc 17 hours ago [ - ]

That is essentially what the reasoning reinforcement training does. It is getting the model to say things that are more likely to result in the correct final answer. Everything it does in between doesn't necessarily need to be valid argument to produce the answer. You can think of it as filling the context with whatever is needed to make the right answer come out next. Valid arguments obviously help. but so might expressions of incorrect things that are not obviously untrue to the model until it sees them written out. The What's The Magic Word paper shows how far that could go. If the policy model managed to learn enough magic words it would be theoretically possible to end up with an LLM that spouts utter gibberish until delivering the correct answer seemingly out of the blue.

tonyarkles 17 hours ago [ - ]

That's pretty cool, thanks for the extra context! (pardon the... not even pun I guess)

Also, thanks for pointing me at that specific paper; I spend a lot more of my life closer to classical control theory than ML theory so it's always neat to see the intersection of them. My unsubstantiated hypothesis is that controls & ML are going to start getting looked at more holistically, and not in the way I normal see it (which is "why worry about classical control theory, just solve the problem with RL"). Control theory is largely about steering dynamic systems along stable trajectories through state space... which is largely what iterative "fill in the next word" LLM models are doing. The intersection, I hope, will be interesting and add significant efficiency.

nothinkjustai 20 hours ago [ - ]

Z.ai’s cloud offering is poor, try it with a different provider.

complexworld 8 hours ago [ - ]

could you add some context for why you think it's poor?

dev_l1x_be 19 hours ago [ - ]

Benchmarking is grossly misleading. Claude’s subscription with Code would not score this high on the benchmarks because how they lobotomized agentic coding.

solomatov 18 hours ago [ - ]

>but I have seen the local 122b model do smarter more correct things based on docs than opus

Could you please share more about this

alex7o 15 hours ago [ - ]

Maybe a bit misleading. I have used in in two places.

One Is for local opencode coding and config of stuff the other is for agent-browser use and for both it did better (opus 4.6) for the thing I was testing atm. The problem with opus at the moment I tired it was overthinking and moving itself sometimes I the wrong direction (not that qwen does overthink sometimes). However sometimes less is more - maybe turning thinking down on opus would have helped me. Some people said that it is better to turn it of entirely when you start to impmenent code as it already knows what it needs to do it doesn't need more distraction.

Another example is my ghostty config I learned from queen that is has theme support - opus would always just make the theme in the main file

OtomotO 21 hours ago [ - ]

Many people averted religion (which I can get behind with), but have never removed the dogmatic thinking that lay at its root.

As so many things these days: It's a cult.

I've used Claude for many months now. Since February I see a stark decline in the work I do with it.

I've also tried to use it for GPU programming where it absolutely sucks at, with Sonnet, Opus 4.5 and 4.6

But if you share that sentiment, it's always a "You're just holding it wrong" or "The next model will surely solve this"

For me it's just a tool, so I shrug.

balls187 20 hours ago [ - ]

> I've used Claude for many months now. Since February I see a stark decline in the work I do with it.

I find myself repeating the following pattern: I use an AI model to assist me with work, and after some time, I notice the quality doesn't justify the time investment. I decide to try a similar task with another provider. I try a few more tests, then decide to switch over for full time work, and it feels like it's awesome and doing a good job. A few months later, it feels like the model got worse.

runarberg 20 hours ago [ - ]

I wonder about this. I see two obvious possibilities (if we ignore bias):

1. The models are purposefully nerfed, before the release of the next model, similar to how Apple allegedly nerfed their older phones when the next model was out.

2. You are relying more and more on the models and are using your talent less and less. What you are observing is the ratio of your vs. the model’s work leaning more and more to the model’s. When a new model is released, it produces better quality code then before, so the work improves with it, but your talent keeps deteriorating at a constant rate.

ehnto 20 hours ago [ - ]

I definitely find your last point is true for me. The more work I am doing with AI the more I am expecting it to do, similar to how you can expect more over time from a junior you are delegating to and training. However the model isn't learning or improving the same way, so your trust is quickly broken.

As you note, the developer's input is still driving the model quite a bit so if the developer is contributing less and less as they trust more, the results would get worse.

tonyarkles 19 hours ago [ - ]

> However the model isn't learning or improving the same way, so your trust is quickly broken.

One other failure mode that I've seen in my own work while I've been learning: the things that you put into AGENTS.md/CLAUDE.md/local "memories" can improve performance or degrade performance, depending on the instructions. And unless you're actively quantitatively reviewing and considering when performance is improving or degrading, you probably won't pick up that two sentences that you added to CLAUDE.md two weeks ago are why things seem to have suddenly gotten worse.

> similar to how you can expect more over time from a junior you are delegating to and training

That's the really interesting bit. Both Claude and Codex have learned some of my preferences by me explicitly saying things like "Do not use emojis to indicate task completion in our plan files, stick to ASCII text only". But when you accidentally "teach" them something that has a negative impact on performance, they're not very likely to push back, unlike a junior engineer who will either ignore your dumb instruction or hopefully bring it up.

> As you note, the developer's input is still driving the model quite a bit so if the developer is contributing less and less as they trust more, the results would get worse.

That is definitely a thing too. There have been a few times that I have "let my guard down" so to speak and haven't deeply considered the implications of every commit. Usually this hasn't been a big deal, but there have been a few really ugly architectural decisions that have made it through the gate and had to get cleaned up later. It's largely complacency, like you point out, as well as burnout trying to keep up with reviewing and really contemplating/grokking the large volume of code output that's possible with these tools.

svnt 20 hours ago [ - ]

Your version of the last point is a bit softer I think — parent was putting it down to “loss of talent” but yours captures the gaps vs natural human interaction patterns which seems more likely, especially on such short timescales.

runarberg 20 hours ago [ - ]

I confusingly say both. First I say that the ratio of work coming from the model is increasing, and when I am clarifying I say “your talent keeps deteriorating”. You correctly point out these are distinct, and maybe this distinction is important, although I personally don‘t think so. The resulting code would be the same either way.

Personally I can see the case for both interpretation to be true at the same time, and maybe that is precisely why I confused them so eagerly in my initial post.

rescbr 17 hours ago [ - ]

I don’t think the providers intentionally nerf the models to make the new one look better. It’s a matter of them being stingy with infrastructure, either by choice to increase profit and/or sheer lack of resources to keep n+1 models deployed in parallel without deprecating older ones when a new one is released.

I’d prefer providers to simply deprecate stuff faster, but then that would break other people’s existing workflows.

flux3125 19 hours ago [ - ]

Point 2 is so true, I definitely find myself spending more time reading code vs writing it. LLMs can teach you a lot, but it's never the same as actually sitting down and doing it yourself.

e12e 20 hours ago [ - ]

I think it might have to do with how models work, and fundamental limits with them (yes, they're stochastic parrots, yes they confabulate).

Newer (past two years?) models have improved "in detail" - or as pragmatic tools - but they still don't deserve the anthropomorphism we subject them to because they appear to communicate like us (and therefore appear to think and reason, like us).

But the "holes" are painted over in contemporary models - via training, system prompts and various clever (useful!) techniques.

But I think this leads us to have great difficulty spotting the weak spots in a new, or slightly different model - but as we get to know each particular tool - each model - we get better at spotting the holes on that model.

Maybe it's poorly chosen variable names. A tendency to write plausible looking, plausibly named, e2e tests that turns out to not quite test what they appear to test at first glance. Maybe there's missing locking of resources, use of transactions, in sequencial code that appear sound - but end up storing invalid data when one or several steps fail...

In happy cases current LLMs function like well-intentioned junior coders enthusiasticly delivering features and fixing bugs.

But in the other cases, they are like patholically lying sociopaths telling you anything you want to hear, just so you keep paying them money.

When you catch them lying, it feels a bit like a betrayal. But the parrot is just tapping the bell, so you'll keep feeding it peanuts.

taurath 20 hours ago [ - ]

I agree - the problem is it’s hard to see how people who say they’re using it effectively actually are using it, what they’re outputting, and making any sort of comparison on quality or maintainability or coherence.

In the same way, it’s hard to see how people who say they’re struggling are actually using it.

There’s truth somewhere in between “it’s the answer to everything” and “skill issue”. We know it’s overhyped. We know that it’s still useful to some extent, in many domains.

balls187 20 hours ago [ - ]

Well summarized.

We're also seeing that the people up top are using this to cull the herd.

psychoslave 20 hours ago [ - ]

What is it that is dogma free? If one goes hardcore pyrrhonism, doubting that there is anything currently doubting as this statement is processed somehow, that is perfectly sound.

At some point the is a need to have faith in some stable enough ground to be able to walk onto.

Wolfbeta 19 hours ago [ - ]

Who controls that need for you?

ecshafer 20 hours ago [ - ]

All people think dogmatically. The only difference is what the ontological commitments and methaphysical foundations are. Take out God and people will fit politics, sports teams, tools, whatever in there. Its inescapable.

smallmancontrov 19 hours ago [ - ]

All people think dogmatically, but religion does not prevent people from acting dogmatically in politics, sports, etc. It just doesn't. It never did.

Under normal circumstances I'd consider this a nit and decline to pick it, but the number of evangelists out there arguing the equivalent of "cure your alcohol addiction with crystal meth!" is too damn high.

bensyverson 20 hours ago [ - ]

Allow me to introduce you to Buddhism

ecshafer 19 hours ago [ - ]

Elaborate. Buddhism is going to have the same epistemological issues as anything, since its a human consciousness issue.

bensyverson 18 hours ago [ - ]

> since its a human consciousness issue

I'd encourage you to check it out for yourself. It's certainly possible to be a dogmatic Buddhist, but one of the foundational beliefs of Buddhism is that the type of dogmatic attachment you're describing is avoidable. It's not easy, but that's why you meditate.

tauroid 19 hours ago [ - ]

https://en.wikipedia.org/wiki/Prat%C4%ABtyasamutp%C4%81da

svnt 20 hours ago [ - ]

Which one?

bensyverson 20 hours ago [ - ]

Zen

svnt 19 hours ago [ - ]

The Western Zen? In my experience it is downgraded from being a religion to being a system of practice which relieves it of the broader Mahayana cosmology. But I would suggest the dogma is less obvious but still there, often just somewhere else, such as in its own limitations, or in a philosophical container at a higher level such as scientism.

bensyverson 18 hours ago [ - ]

All Zen is about releasing those attachments. Granted it's pretty hard, because if you succeed, you're enlightened.

East, West, Religion, Practice… From a Zen perspective, you're just troubling your mind with binaries and conflict.

svnt 17 hours ago [ - ]

Ah and there is the dogma -- the otherness of the enlightened.

The binaries still functionally exist. I see a lot of value in reflective practices. At the same time it seems unlikely to me that the point of existing is to not trouble your mind.

bensyverson 16 hours ago [ - ]

There's a saying in Zen: if you meet the buddha on the road, kill him. The point being, the very exaltation of enlightenment is an impediment.

If Buddhism can be said to have a goal, it is to reduce suffering (including your own), so troubling your own mind is indeed something it can help with. The point of existence would be something interesting to meditate on. If you discover it, let us all know!

svnt 8 hours ago [ - ]

This dancing between positions is all very defensible and if the path is currently working for you, more power to you.

Dogma, like the binaries, still functionally exists, whatever the narrative. If you can’t admit that, that might also be something interesting to meditate on.

Say you have eliminated all suffering. How many versions of that world exist? How many of them are true, beautiful, and good? See how, in order to evaluate the success or failure of Buddhism, we have to move beyond “eliminate suffering” to a higher value standard?

OtomotO 20 hours ago [ - ]

Dogmatism is a spectrum and for too many people it's on the animal side of the scale.

taneq 20 hours ago [ - ]

I wonder to what degree it depends on how easy you find coding in general. I find for the early steps genAI is great to get the ball rolling, but rapidly it becomes more work to explain what it did wrong and how to fix it (and repeat until it does so) than to just fix the code myself.

slopinthebag 12 hours ago [ - ]

Yes, this and also taste. What might be perfectly fine for one developer is an abomination for another who can spot the problems with it.

I think in every domain, the better you are the less useful you find AI.

redsocksfan45 19 hours ago [ - ]

[dead]