I think GPT writes code the best. How well will it write in version 5.6? It gives me chills.

Recently, I went head-to-head with GPT on nearly 2,000 lines of code, and GPT's solution was superior and faster. I even referenced multiple codebases on GitHub while trying, but they were incomparable to GPT.

So using GPT brings both fear and excitement.

The fear comes from realizing that this level of code is now the average for most people. The excitement comes from knowing that I can now study and learn at this level too.

I'm really looking forward to seeing how much more advanced the code will be with the upgrade to 5.6.

I am on the opposite camp. Open models are starting to perform better. GPT 5.5 keeps on messing things up.

On the contrary, pi + glm + DeepSeek… bliss.

Fable was a different kind of beast though. Rip.

Every time I use opus these days I go shut up... you are not fable.. Hard to imagine how just three days with it changed how I saw LLM use.

I really don't feel this way. Seemed pretty similar to me, noticeably better, but marginally. What am I missing?

It may depend on your specific workload. E.g. for regular webdev work Opus is more than adequate, for heavy duty data analysis, for experimental stuff and for complex systems it was night and day.

I had only a few places where I did spot a difference but that difference was significant and I can imagine where people would be amazed.

It's interesting, I tried a decent amount of "heavy duty data analysis", and found it pretty similar. But a lot of what I did was about it finding and cobbling together the right things from our existing library of domain specific tooling, which opus is already good at. But perhaps it would have impressed me more if it were starting from zero.

What kind of "experimental stuff and complex systems" did you try that it excelled at?

Nothing. It had marginal gains. People just romanticize it cause it's gone.

Yes, I've just come to the end of implementing all the planning I did while Fable was available. And nothing now comes close to creating plans that could be coded and just worked like it did.

On a large C codebase, Claude hallucinates constantly, and GPT 5.5 gets there are with a lot of help, but still gets things wrong.

I'm reluctantly starting to feel grateful that I went camping right over the window that Fable was out.

Same.

Yeah, Opus/GPT need multiple rounds of reviews from each other to get to clean auto review. Fable was like, it is done and indeed… crickets in bot comments. ‘No issues’ galore.

I wonder if this will hold as other models with different biases achieve parity.

GPT-5.5 has been really hard to beat imho. I've spent $$$ on Opus, Deepseek v4 Pro and recently started to dogfood GLM-5.2 (which is not bad) but I cannot really trust any of them (almost blind) like I can trust GPT-5.5. It gives me tremendous confidence. I cannot say the same for any of the others I mentioned.

Ditto on GLM 5.2 + DeepSeek V4 Flash combo.

For most important work (complex, cross-domain inquiries etc.), I still rely on Codex GPT 5.5 though.

How are you running glm and deepseek? Local or hosted? If the latter, where do you run it?

OpenCode has a $10/mo sub that includes both of those

how much does your setup cost you? just curious

>> I am on the opposite camp. Open models are starting to perform better. GPT 5.5 keeps on messing things up.

I'm working in a 600k+ LoC codebase that has complex domain-specific logic and lots of moving parts. I find that Codex 5.5 is pretty good at surgical fixes, but does not go out of its way to explore and figure out what those surgical fixes might break. So I only use it to work on parts of the system that are pretty isolated from everything else so that risk of regression is small.

I'm trying not to be the "you're holding it wrong" guy, but ... have you just tried telling it to explore the codebase for things it might break?

Purely subjective, but I tend to prefer reading Opus 4.8 output over GPT 5.5 code, even when the latter can have a higher overall ceiling. The former is just a bit more convenient to review.

> I think GPT writes code the best. How well will it write in version 5.6? It gives me chills.

Heard this exact sentence multiple times a few months ago about Opus 4.6, then 4.7 and 4.8 were considered a disappointment and today people miss "the good old times of 4.6" (referring to a few weeks of February 2026).

Very fascinating to look at all of this unfolding.

Reading this thread makes me feel like I'm taking crazy pills. The folks on this train in my team do not produce anything significant that we can rely on or use. A lot of hollow prototypes that join the prototype graveyard and code that needs extra scrutiny on critical areas ultimately leading to taking longer.

It's a shame, they were smart and productive engineers. Now? I guess everyone is just all-in on the slot machine.

This split in what different people or groups get out of LLMs is pervasive and really interesting. In the beginning I was dismissive of those with bad experience with a "you are holding the tool wrong" smugness. But as I read more and more experience, I see all combos and I now know my initial knee jerk conclusion was clearly wrong. There are newbie programmers getting good or bad results as well as experienced developers getting either flip of the coin. I don't know what to conclude. I really want to know what are the lines that explain these very different outcomes. Is it the types of problems being solved? The harnesses? The programming languages? FWIW, my experience has been that among my cohorts of mid to deeply experienced developers working in the domain of experimental physics, all have leveled up various degrees after adopting Sonnet and Opus level LLMs using claude code CLI in Python, C++ and web tech, small scale scripts up to multi-package novel system develop and green field as well as incremental development and code maintenance.

I have seen plenty of greenfield projects go okay at first but never go the distance. These were mostly product software cases, where they were able to get something very professional looking very fast but AI ultimately always miss the mark because they are taking the median of what exists and not the specific needs of the customer they're developing for. So they get a ton of features and few that were necessary, then developing it further and correcting it to the needs of the customer just makes a mess and regressions are frequent. This is my experience as well when it comes to being a consumer of software products, everything feels shittier and less reliable, perhaps that's my emotion and bias coming out.

The last 20% of the software development cycle is always the hardest. Releasing, maintenance, usability, support. You know, having a real product. I don't see AI helping here at all, more the first 80%, which sadly is also the fun part.

When developing things that are novel, with designs specific to our use cases needing high throughput, the results are pretty dismal. AI can kind of get you there, but I've seen no advancement on this front with new models. At the end of each attempt we've always realized we should have done things by hand. Having people with intense knowledge of the system frequently comes from building it and troubleshooting it, I don't think serious engineering orgs have escaped this inevitability.

On cases where we have legacy software, AI has helped with understanding shit code and design, but woefully bad at contributing to legacy software. Here be dragons for sure. It is super strange to me that these tools can seemingly easily diagnose but completely blunder the fix.

I could easily see there being gains, as you say, in fields where data wrangling becomes tedious (though the inherent error rate in AI outputs scares me if you're trying to get deterministic outputs from experiments... I digress).

The part I think this forum tends to forget, and the tech industry at large fails to even care about, is that we're still humans. There are many studies basically pointing out that the way the AI outputs information is bad for us. Instant gratification from anthropomorphised machines with a habit for sycophancy doesn't sound like a recipe for a healthy relationship with what everyone wants to claim is just a tool. AI providers know this is effective, as well as knowing that there is a gambling effect here. They care about making money, not a good product and they happily prey on our human weaknesses. That is what social media is now. They aren't good products anymore, they just promote addiction via engagement.

Sorry for the long response and cynicism, but that is just my anecdotal experience and perspectives. I can give sources to some of the objective claims if you want.

[dead]

[dead]

I'm suspect on how much of a coding advance it will be.

Seems odd that their announcement has zero coding benchmarks, with the closest related thing being terminal bench.

Tracking model performance on Artificial Analysis makes me think these models are constantly optimized/tuned in some way or another. GPT 5.5 was scoring in the mid 60's when it was first released, now it's almost 10 points higher.

Maybe I'll know once I try it? Honestly, for small functions or methods, I don't think there's a huge difference between models. But the larger the code gets, the more noticeable the difference seems to be.

Personally, I think this kind of coding experience varies from person to person

Not the size of function but conplexity.

sadly with all the labs benchmaxxing I feel like you just have to try the model for a while to really evaluate how good it is, especially for each individual use case

>zero coding benchmarks

"What gets measured gets managed"

They claim extreme performance on ExploitBench, which Mythos was touted as being incredible at. https://x.com/OpenAI/status/2070555278576439306

My guess is that it's same base model as 5.5, but with additional post-training to improve and benchmaxx on a few things like that.

If they really thought it was competitive with Mythos/Fable across the board, then why wouldn't they release a broader set of benchmarks, and why price it day 1 at 1/2 the cost of Fable?

>and why price it day 1 at 1/2 the cost of Fable?

Why would they price it the same as Fable it it doesn't cost the same as Fable ?

That's half my point - Anthropic's remarks suggest that is Fable significantly bigger (hence more costly to run) than Opus, so it is priced accordingly, but GPT 5.6 priced the same as 5.5 is one datapoint that suggests they are the same size.

On graph, they are still slightly bellow Mythos. Maybe enough to not be prohibited by US government?

I have long felt like "out of the box", I really dislike gpt's coding style. It seems really verbose and likely to write way too much error handling and wordy comments and worse at finding existing functionality to reuse rather than writing everything from scratch. This has been relatively easy to mitigate with prompting, but I still find it annoying.

YMMV I guess!

I think you could be right. I do use excessive error-handling code and verbose comments — that's true.

But most of my time is spent on delivery, and the biggest problem with delivery is that if a bug occurs during runtime, the client curses me out. So to me, GPT code feels meticulous.

Open source contributors might be different. Most of them write code after long periods of deliberation. They take their brightest ideas and put them into open source. Those pieces of code are probably the best answers those programmers can give.

But for someone like me, who works primarily on delivery, we mostly plug in proven patterns and focus on getting things done. 'It works' and 'it's beautiful' are different terms, after all. In that sense, I highly value the meticulousness of GPT code — the very thing you called verbose. Because even if it's inefficient, at least it runs, and it catches and wraps around far more of the parts where things break.

Given a month, I could probably write code at GPT's level, at least to some degree. The problem is the difference between one hour and one month. At its core, AI code is still based on training data.

You don't want to handle errors in all the leaves of the system the way AIs have a tendency to, because you very rarely have the right context that deep in the stack to actually handle the error in an intelligent way. So what they end up doing (IMO) is actually hiding problems deep in the stack, in this effort to avoid a visible crash.

I think it's very similar to the tendency to write too much from scratch and reuse too little, in both cases what is necessary is a broader view of how the whole system fits together, rather than only the specific method / file / module being written.

You don't dismiss me, so I'd like to respond to your comments within the bounds of my own knowledge, even though I can't compare to a programmer as skilled as you.

I don't think that's entirely wrong. But human code has the same problem, just in the opposite direction — because humans trust too much. The issue arises from the assumption that 'the other side will handle it.'

For example, good API design says you should only send as much data as needed, but in practice, programmers like me can't do that. Because three different companies, all on the lowest bid, are trusting each other's domains, so in API design they lay out the entire dataset and tell the frontend to filter it. On the other hand, if you design a lean API layout with just what's truly needed and submit it, the frontend company gets angry. So what do I do then? I document everything precisely. I write down that I designed the API this way, but the other company did it that way, and I create documentation and error codes to shift the responsibility over to them, stating that they should handle the filtering on their side

So while there are good programming practices and conventions, in reality we're under pressure from low bids and tight deadlines.

AI code doesn't have a full system map, so it's hard for it to decide how far to propagate errors and where to stop, but I think that part can just be pruned by AI anyway.

Usually in error handling, we use Result<T> a lot, right? For libraries or frameworks, Result<T, E> is common. You centralize your error policy and usually design programs with policy and error policies built in. You create an error policy table with about 7 or 8 types like ValidationError, NotFoundError, ExternalApiError, and within that, you only take responsibility for your own scope.

At the design stage, if you have a clear initial vision, maybe it works. But in practice, the PM changes things mid-project. So in the end, your ideal code approach is correct in theory, but for practical delivery survival, the GPT approach ends up being more realistic. The reason is simple: you can't trust the other side at all, so you create evidence in case of contractual risk.

Because our domains are different. Programmers at service companies aim for long-term maintenance, so their domain boundaries are clear. But the companies that come to me are often in a dirty state where such clear domain separation is impossible. That's where I think the difference lies.

So while I understand your point, I suspect we are optimizing under very different constraints.

AI code usually creates fake 'cohesion.' It looks good on the surface. But in reality, it's often just optimized for the moment, weak to change. After reading Code Complete 4 or 5 times, I became obsessed with the idea that I need to balance cohesion and coupling. AI code has strong local cohesion, but when you look at the overall cohesion, it's weak.

True cohesion is usually about 'things that change for the same reason are grouped together.' But the fake cohesion that AI creates is usually this: 'Neatly organize the given requirements for now.'

On the surface, it just repeats obvious hexagonal or clean architecture patterns like Service, Manager, Handler, Validator, Repository. But the problem is that human code does the same thing. Honestly, I don't trust most people on HN who claim they're different. Even the enterprise code I've bought and the real big-company code I've seen don't have perfectly beautiful separation.

And that's natural. Modeling is always unstable. A single word from a PM saying 'we need to add a coupon' can break a beautifully designed domain.

AI often puts UserService and UserValidator into its structures, but in reality, the reasons for change aren't just one. They bundle multiple reasons together. There's just some flawed modeling.

But what matters is something else. It fits the 'current' input well. When you start digging deeper into the prompt, AI ends up turning the code into enterprise patterns based on the depth of meaning it parsed. And then this problem arises.

Human programmers usually don't have uniform code quality. Of course not. You and I are only deep in our own areas of expertise; outside of that, we're terribly shallow. But AI tries to fix other areas based on the deepest part. That results in verbose and cumbersome code. Small, elegant code becomes verbose, flat, and turns into the patterns we've all seen before.

But I don't think that's necessarily a bad thing. Why? Because realistically, I think it's better in the long run. The uniform enterprise patterns that AI produces are ultimately predictable and searchable.

Top-quality code deviates from the average. That makes it hard to predict. But that's not my level. So I think that when genius programmers contribute to the world through libraries and frameworks, people like me, who aren't talented, build things with them. And for that, predictable code is more than enough.

AI code is easier for AI to read and fix later. Human code is harder to predict. That makes it harder for me to maintain. In the garden of open source, people obsess over 'good code' quality, but for me, if I fail, the work stops coming.

The difference between you and me is that you're a better programmer than I am, and I'm just a beginner who's more indifferent to that performance gap. We value different things. And I think your perspective is more 'programmer-like.' I respect it.

Is it possible for you to provide examples? What were you trying to solve? What was your solution and why was GPT's solution superior and faster?

Not trying to be mean but it's likely the case that OP is not evaluating this properly, either due to a lack of skill or a lack of objectivity

> ... why was GPT's solution superior and faster?

Not saying that's the case with OP, but I've found folks sometimes just rationalize it so [0] as they're paying top dollar for it (especially, when compared to may be less capable but affordable models).

[0] https://en.wikipedia.org/wiki/Choice-supportive_bias

I haven't tried the latest Codex but I switched from GPT to Claude because I think Claude writes much better Code. GPT's code ends up way more verbose/complex/overengineered than it needs to be.

> I even referenced multiple code bases on GitHub

Well, GPT referenced every GitHub code base, no wonder it won! :)

I prompted Codex 5.5 to one shot something where I wanted the design to have a pluggable decision module. I gave it a few examples of the kinds of inputs and actions I expected. I did not constrain it beyond that high level of what I wanted. The design it came up with was very good. Easily on par with what any senior engineer at big tech would. And cleanly decoupled in a way that would make future refactoring simple. I was damn impressed.

How do you judge what is a good or bad thing to learn from a LLM? So you don't have to unlearn the bad bits later

When I searched for papers on using LLMs, I found that typically, you can have an LLM generate code and then ask it to find GitHub projects similar to that code. Then you can learn by looking at the pull requests and seeing how they structure things In the old days, if I wanted to understand why memory offsets, padding techniques, or data layout structures were written a certain way, I had to stare at a senior programmer's code all day or wait for them to reply. But LLMs, while they do flatter me, explain things at a level I can actually understand. And LLMs don't get annoyed.

There's a lot of tacit knowledge in programming.

-Why do you cut API boundaries this way? -Why do you change the order of struct fields? -Why do you deliberately insert padding?

Most of it depends on the background and context. Sometimes you add it, sometimes you don't. To understand this tacit knowledge, you need access to senior developers. But their attitude often depends on how promising the student is and what background they come from. On top of that, you don't have to rely on the respondent's mood, authority, or availability.

Programming is fundamentally a field that requires seniors. In my case, I had no such seniors at all. I learned to code by buying codebases from failed companies and studying them. My first job didn't hire me as an employee—they hired me as the CEO of a subcontracting company (because that was structurally more advantageous for the contract). So I wasn't given the patience to learn programming fundamentals gradually. I had to pay penalties if I failed. Most of the projects I worked on were the kind where failure meant bankruptcy for me. Naturally, there was no one to teach me.

Most of my knowledge comes from reverse-engineering the code I purchased.

People say LLM code contains falsehoods, but commercially sold code has always had falsehoods too. Honestly, if we're just talking ratios, LLM code has fewer falsehoods.

In that sense, I still think it's a matter of context. If LLM code is false, was human code ever really true? LLMs do lie. They generate plenty of incorrect code. But humans do the same thing. If a problem comes up, you just look it up then and there. For me, LLMs and humans aren't all that different.

What do you think of modern open-source codebases presently available to the public? Is closed-source/proprietary code that much better?

Closed, proprietary code is way, way worse.

Good programmers are ashamed to push anything less than good (at least in their own opinion) to popular public repos. Some of those same pedantic programmers have no problem pushing crap in enterprise repos, and feel absolved because they are pushed to focus on deadlines, new features, and refactoring is very rarely planned for. I did and managed a lot of corporate software development in companies big and small, and did my fair bit of M&As and looked at codebases of successful companies. I dont ever recall feeling impressed. And I am regularly impressed by the aesthetic qualities of popular open source packages. I think commercial code is mostly shit, with the exception of regulated, serious industries (power, space, flight, etc.).

Open source is much better. Closed source is mostly considered 'done' as long as it just works.

One is a 'craft,' the other is 'survival for delivery.'

To elaborate a bit more: open source is about 'symbolic capital' — it's about building a reputation that says, 'I can write code at this level.'

Commercial closed source, on the other hand, is about 'I need to make money by writing this.'

Generally, open source projects tend to have less code written over time, especially when the contributors aren't depending on it for their livelihood. But with commercial closed source, it's not uncommon to have to write 60,000 lines of code per month.

On top of that, open source rarely has to deal with requirements changing dramatically mid-development. With closed source, requirements often shift from the original plan, and you end up compromising code quality just to meet those changing specs. As a result, if you're comparing purely in terms of logical completeness, open source tends to be better.

For example, singletons are rarely used in modern open source, but they're still pretty common in commercial code these days.

Codex 5.4/5.5 has been great for me as well compared to Claude Opus.

I've been mostly using it for Godot/GDScript code reviews, rubber duckying, asking it for better ideas for naming stuff (one of the hardest problems in programing)

I still can't trust it for generating code for entire files/classes/projects, because it's still icky, creating unnecessary variables and functions, using multiple `if`s instead of `and` or `or`, but it's good enough for generating Mac/iOS apps for my personal use in SwiftUI because fuck trying to keep up with Apple's documentation, or even migrating ancient Visual Basic stuff I made as a kid up to SwiftUI :)

> So using GPT brings both fear and excitement.

Only excitement for me. I've never been more productive, not because I ask AI to make something for me, but it helps me make what I was already going to, but better and quicker.

AI like any other tool could help smart people be smarter and dumb people be dumber, rather kinda like Toklien's Ring: You could be Sauron or you could be Bilbo or Frodo, or you could be Gollum :)

For me in Game dev, codex has a habit of checking every argument for null and then silently early exiting the methods when true. I have explicit instructions for it not to do this - but it still does. I haven't done any c# outside game dev but I have no idea why people would want their programs to silently fail.

And this is why having null in the type system is better.

Same; I explicitly added an instruction in AGENTS.md to tell it that sometimes it's better to crash if something crucial is missing at runtime, but it keeps insisting on checking for null references and other invalid values.

It's better if I don't let it generate code and just use it for reviewing my code.

[deleted]

No offense but have you considered the strong possibility that you’re just not good at what you do? I am occassionally pleased but mostly annoyed or disappointed… but never getting anything close to chills. That sounds downright weird.

You're not wrong. But programming isn't something only talented people do.

Really? That doesn't line up with this forum.

As a non-software engineer reading this forum it sounds like everyone is basically von Neumann working on Operator algebras and Lattice theory.

I assumed that is why the view of LLMs is so negative on here. While Claude seems kind of amazing to me I am not a genius working on Lattice theory like most people here.

Another strong possibility is that you might be working on something that’s not very prevanlent in the training set.

Even the choice of programming language matters, e.g. Java or Javascript vs some niche one.

No offense but have you considered the strong possibility that you're just holding it wrong? You're entitled to your opinion, but OP is hardly the first person to say something like this and is surrounded by tons of folks saying the exact same thing. Just because it sounds weird to you, doesn't mean it's not true.

Everyone saying it is in the "not as good as they think they are" camp is the very obvious explanation.

Idk, all the great programmers I've come to respect are of the opinion that the code it outputs, while often useful, is not high quality. Likewise, all of the influencers and "thought leaders" I have seen on social media who I did not have a high opinion of previous to 2022, have all become AI influencers and make these kinds of claims. So while it's possible that the great programmers are not capable of using this tool effectively, I doubt that is the case, seeing as the mythical 10x productivity improvements have not materialised.

The tech has raised the floor not the ceiling.

Whether the latter happens remains to be seen.

That sounds accurate.

Or rather, they raised the perceived floor. IDK if we're seeing better output, but at least the illusion of output is stronger.

There are also differences in usage patterns, and differences in the quality of thinking. Not all programming revolves around Western open source

Well, I guess it's just a difference of opinion on who's right

By definition, 50% of developers are below average, so there are indeed "tons of folks" who are not very good at what they do.

That is not how averages work. By definition of mean, perhaps.

That is how a median is defined, not the mean.

Indeed. Most people have more arms than average, which must be 1.9 something.

"no offense..."

... then says offensive thing.