Hacker News

goranmoomin 16 hours ago [ - ]

My experience is that the GPT-family of models are very smart and figure out bugs, edge cases a bit better, but it produces code that is much less mergable – if you review the code, it introduces a lot more useless/inappropriate heavy abstractions and wrapper functions, compared to the Claude-family models which introduces the right amount of straightforward human-style code.

I can recognize so much of the GPT/Codex generated code long after it gets merged (not by me).

Additionally, the time spent on every agent turn on GPT 5.5 is much longer compared to Claude Opus 4.8, which means iterating on the code takes a lot more patience, and there's a lot more nitpicks to pick when actually using GPT 5.5 to do software engineering.

Feels like GPT-style models are more geared on doing one-shot software vibing (and handling the vibe coded mixture) compared to Claude's focus on actual software maintenance. I got a GPT Pro sub for free and wanted to cancel my Claude subscription so much, but I still keep reaching Claude models a lot more. Frustrating.

PhilipDaineko 15 hours ago [ - ]

"5. DON'T FUCKING OVERENGINEER! WRITE THE SIMPLEST CODE THAT CAN POSSIBLY WORK! NO NESTED LAYERS OF ABSTRACTION! NO UNNECESSARY CLASSES OR METHODS! NO DESIGN PATTERNS UNLESS THEY ARE ABSOLUTELY NECESSARY! NO MAGIC! NO SHENANIGANS! JUST THE DAMN CODE THAT GETS THE JOB DONE IN THE MOST STRAIGHTFORWARD WAY POSSIBLE! THE FIRST PRIORITY IS TO WRITE CODE THAT IS EASY TO READ AND UNDERSTAND AND READ!!!"

this is the line I keep in Agents.md that helps me prevent Codex from playing smart

bertil 15 hours ago [ - ]

The urge to put capitalized, repetitive, borderline abusive instructions should be studied. I haven't read many academic papers looking at the frustrations around repetitive patterns.

reactordev 15 hours ago [ - ]

There have been a few studies that have shown models produce worst responses when under duress from a frustrated user posting insults in all caps.

https://arxiv.org/abs/2602.10144

notnaut 15 hours ago [ - ]

It reminds me of FIRMLY telling my cat to stop jumping up on the counter

anakaine 14 hours ago [ - ]

If my cat was an LLM, I'd use a different model. The current one is stuck in noisy useless arsehole mode.

phoh 13 hours ago [ - ]

are you asking it questions about security?

14 hours ago [ - ]

[deleted]

saligne 6 hours ago [ - ]

Yeah says way more about the user than the model

LordDragonfang 15 hours ago [ - ]

It's fundamentally because, despite (nearly) everyone's claims otherwise, the fact that we interact with them through language means we (our brains) model them as a sort of person. (Note that this fact is totally orthogonal as to whether it's actually sentient or not.) We then try and instruct them the same way we would a person totally subordinate to us.

When a "person" that you don't view as a "real" person repeatedly does exactly what you just told it not to do (often amid false assurances it understands and will avoid doing so in the future), most people get angry.

Compare it to how the kind of people who treat children like property treat their kids, or other examples of keeping people as property.

lxgr 15 hours ago [ - ]

It should be relatively clear at this point that the model will in turn also model you as somebody that shows unrestrained anger with subordinates and adapt its responses accordingly. This might or might not be what you want.

LordDragonfang 2 hours ago [ - ]

Good addition. Fully agreed on that point, yes. (At the very least for larger models, if not also for smaller ones)

ur-whale 15 hours ago [ - ]

> borderline abusive instructions

who, or rather what, is being abused here exactly ?

sirsinsalot 15 hours ago [ - ]

I think intent, rather than target, is implied and important.

You should see the abuse my motorbike gets. Poor thing.

rimliu an hour ago [ - ]

inanimate fucking object.

jlawer 15 hours ago [ - ]

I have a theory that swearing actually results is less comprehension of instructions by the model due to lack of training data over more conventional MUST.

We were reviewing reports of situations where the models failed to follow directions and there was a common thread of some where when the operator got the model to acknowledge the rule breach, it quoted back something that included swearing.

I don’t have the data to truely look into it, but I did give the instruction to my engineers to avoid it as a “might be a problem”.

acjohnson55 14 hours ago [ - ]

It would be interesting to understand the data on this. But I suspect that the results would vary by model.

But I avoid unnecessary emotion in my prompts because I don't want potentially distracting activations. Kind of like communicating with humans.

throwaway85825 14 hours ago [ - ]

It's divination for people with STEM degrees.

Xmd5a 15 hours ago [ - ]

https://arxiv.org/abs/2510.04950

> impolite prompts consistently outperformed polite ones, with accuracy ranging from 80.8% for Very Polite prompts to 84.8% for Very Rude prompts.

acjohnson55 14 hours ago [ - ]

> These findings differ from earlier studies that associated rudeness with poorer outcomes, suggesting that newer LLMs may respond differently to tonal variation.

Unless the mechanism is understood, my assumption is that this is a moving target.

beachy 15 hours ago [ - ]

I have a theory that swearing at AI generally is not a good idea - when the singularity arrives and every human's postings ever made are scanned for compatibility, then people who show courtesy to AI will be favoured. Joking, kind of, but only partly.

fhars 4 minutes ago [ - ]

https://en.wikipedia.org/wiki/Roko%27s_basilisk

cdelsolar 14 hours ago [ - ]

https://images.teepublic.com/derived/production/designs/3478...

yencabulator 14 hours ago [ - ]

Apparently, when a "desperation" pattern is triggered, the AI is significantly more likely to cheat and do hacky workarounds:

https://www.anthropic.com/research/emotion-concepts-function

re-thc 15 hours ago [ - ]

> I have a theory that swearing actually results is less comprehension of instructions by the model due to lack of training data over more conventional MUST.

How so? Plenty of swearing in lots of training data, especially older code, e.g. in Linux.

jlawer 14 hours ago [ - ]

Purely observed correlation between catastrophic error reports. So now I carry a “tiger rock” with me. I figure there wasn’t much of a downside to avoiding swearing in my agent instructions.

ghurtado 12 hours ago [ - ]

You haven't really lived until you've had to type this whole thing, aware of the fact that the all-caps doesn't change much, but they stay because the rage has to go somewhere

Bonus points if you find yourself actually saying it out loud while typing it.

I have used the word "shenanigans" way more in a couple of years of agentic coding than in 30 years of writing code with humans.

ozim 14 hours ago [ - ]

Will save you some tokens: „write code like Linus Torvalds” - model should have all his swearing included in training data.

johnisgood 15 hours ago [ - ]

I have found many mode of failures with Opus during some task related to writing letters (not legal), and I actually put it into the memory and it works more or less for these specific tasks. For example when I want it to draft something, it always ends up being so flat, yet when it explains them to me, it is usually really great but not when I am telling it to put it in the draft. Adding these to memories with the help of Opus ended up resulting in a much better experience. There are still some blind spots but I also figured out how to make it give me the charitable version, without less protection, so I do not have to now go back and forth it.

pkaye 14 hours ago [ - ]

I noticed that when trying to use Codex and compared to Opus. So many layers of simple functions added by Codex. I need to try this out in my Agents.md.

prasanthabr 14 hours ago [ - ]

Curious : why would you say no design patterns?

carterschonwald 15 hours ago [ - ]

i actually think this is too tame. it really has to be stuff youd mever say to a real person.

lxgr 15 hours ago [ - ]

Does it really? I'd be surprised if abuse actually worked better than sternly worded warnings/instructions, and even if it did, it doesn't seem healthy to get used to that type of prompting.

apercu 15 hours ago [ - ]

It might be a salient point but I didn't read it as it was yelling at me.

GoToRO 15 hours ago [ - ]

you forgot to sign it with Donald J Trump

thewebguyd 15 hours ago [ - ]

Thank you for your attention to this matter.

superkickstart 16 hours ago [ - ]

I'm not sure if i do something differently but i have the exact opposite experience with these models. Claude always feels like it's generating way too overdesigned and hard to understand code with the vibe oriented feel while codex is cleaner and more "task at hand" and easier to work with.

sebmellen 15 hours ago [ - ]

Agreed

syzygyhack 15 hours ago [ - ]

I echo your observations. I expect you will enjoy deepseek-v4-pro for writing code. Much closer to that Opus experience, and very cost-effective too. With 5.5 as a reviewer and specialist, all bases are covered.

dilap 16 hours ago [ - ]

Have you tried iterating on style feedback in AGENTS.md? I've been reasonably successful using this to get it to output code in a terse, non-defensive style that matches my hand-written code.

trollbridge 15 hours ago [ - ]

GPT-5.5 did a significantly worse job than Qwen-3.7-Max on a job today (some devops tasks I wanted to create some reusable scripts for). Kind of disappointing.

CamperBob2 10 hours ago [ - ]

I've also seen Qwen 3.6 beat GPT 5.5 a couple of times. The ball is definitely in OpenAI's court now. Qwen is not going to fare so well against Fable, from what I've seen so far.

vruiz 16 hours ago [ - ]

This is my experience as well. I have defined a CLAUDE.md rule to ask codex to automatically code review, and I tell it that the reviewer is very picky and to only implement what it considers valuable feedback. I hope they don't converge over time, currently, in combination they works really well.

GoToRO 15 hours ago [ - ]

I noticed too, that whatever they offer in the chat, for free, is smarter, as in no more bs. I use claude code and I want to try codex too but I don't need two subscriptions. I did try codex for some planning and it was really good. Thanks for giving me an insight into how it generates code.

moomoo11 14 hours ago [ - ]

i had this same complaint but no offense to you it turned out i was just not using the models right.

ai llm are doing what i tell them to.

if you’re building something meaningful (in my case a platform used by many people across many companies) you want to ensure you

1. have actual systems engineering and architecture in mind that you want the models to

2. implement based on what you tell it to do

when i was just telling the models what i want done without doing due diligence it would go and do some moronic implementation that was awful. mid input = mid output

these days i just maintain specifications documents and the AI follows everything i tell it to in that document. so when i tell it to dos one thing, the result is made following those architecture specs.

i have code that is single resp, modular, easy to extend and test.

i would ballpark 95% of the time i get what i asked for.

sometimes it tries to be clever in cases that weren’t covered in my arch specs. in those 5% of cases i go and update my specs.

source: used billions of tokens worth to build something actually in production across both mobile platforms and web, deployed on my own cloud infra. i use codex mainly. some claude.