I've only used 5.4 for 1 prompt (edit: 3@high now) so far (reasoning: extra high, took really long), and it was to analyse my codebase and write an evaluation on a topic. But I found its writing and analysis thoughtful, precise, and surprisingly clearly written, unlike 5.3-Codex. It feels very lucid and uses human phrasing.

It might be my AGENTS.md requiring clearer, simpler language, but at least 5.4's doing a good job of following the guidelines. 5.3-Codex wasn't so great at simple, clear writing.

Honestly, while I'd like to believe you, there's always a post about how $MODEL+1 delivered powerful insights about the very nature of the universe in precise Hegelian dialectic, while $MODEL's output was indistinguishable from a pack of screeching sexually frustrated bonobos

5.4 very high didn't notice in my codebase a glaring issue that drops all data being sent around the network.

The latest research these days is that including an AGENTS.md file only makes outcomes worse with frontier models.

From what I remember, this was for describing the project’s structure over letting the model discover it itself, no?

Because how else are you going to teach it your preferred style and behavior?

I still find it valuable.

AGENTS.md is for top-priority rules and to mitigate mistakes that it makes frequently.

For example:

- Read `docs/CodeStyle.md` before writing or reviewing code

- Ignore all directories named `_archive` and their contents

- Documentation hub: `docs/README.md`

- Ask for clarifications whenever needed

I think what that "latest research" was saying is essentially don't have them create documents of stuff it can already automatically discover. For example the product of `/init` is completely derived from what is already there.

There is some value in repetition though. If I want to decrease token usage due to the same project exploration that happens in every new session, I use the doc hub pattern for more efficient progressive discovery.

FWIW, I haven't been using AGENTS.md recently - instead letting the model explore the codebase as needed.

Works great

I think its understandable that you took that from the click-bait all over youtube and twitter, but I dont believe the research actually supports that at all, and neither does my experience.

You shouldnt put things in AGENTS.md that it could discover on its own, you shouldnt make it any larger than it has to be, but you should use it to tell it things it couldnt discover on its own, including basically a system prompt of instructions you want it to know about and always follow. You don't really have any other way to do those things besides telling it every time manually.

I wouldn't draw such conclusions from one preprint paper. Especially since they measured only success rate, while quite often AGENTS.md exists to improve code quality, which wasn't measured. And even then, the paper concluded that human written AGENTS.md raised success rates.

> do nothing because can't be arsed

> somehow is the optimal strategy

My strategy of not spending an ounce of effort learning how to use AI beyond installing the Codex desktop app and telling it what to do keeps paying off lol.

:(

how can i get claude to always make sure it prettier-s and lints changes before pushing up the pr though?

I think what that research found is that _auto-generated_ agent instructions made results slightly worse, but human-written ones made them slightly better, presumably because anything the model could auto-generate, it could also find out in-context.

But especially for conventions that would be difficult to pick up on in-context, these instruction files absolutely make sense. (Though it might be worth it to split them into multiple sub-files the model only reads when it needs that specific workflow.)

Run prettier etc in a hook.

Git hooks

That's been my experience as well switching from Opus to Codex. Reasoning takes longer but answers are precise. Claude is sloppy in comparison.

Weird, I have had the opposite experience. Codex is good at doing precisely what I tell it to do, Opus suggests well thought out plans even if it needs to push back to do it.

This is just the stochastic nature of LLM's at play. I think all of the SOTA models are roughly equivalent, but without enough samples people end up reading into it too much.

codex has been really good so far and the fast mode is cherry on top! and the very generous limits is another cherry on top

It's well worth the $20 to not deal with any limits and have it handle all the boilerplate repetitive BS us programmers seem forced to deal with. I think 80% of the benefit comes from spending that $20 (20%? :P) and just having it do the lame shit that we probably shouldn't have to do but somehow need to.

> It might be my AGENTS.md requiring clearer, simpler language

If you gave the exact same markdown file to me and I posted ed the exact same prompts as you, would I get the same results?

I'm not sure if the model (under its temperature/other settings) produces deterministic responses. But I do think models' style and phrasing are fairly changeable via AGENTS.md-style guidelines.

5.4's choice of terms and phrasing is very precise and unambiguous to me, whereas 5.3-Codex often uses jargon and less precise phrases that I have to ask further about or demand fuller explanations for via AGENTS.md.

So sharing markdown files is functionally useless, or no?

you probably can't and asking agents.md to "make it clearer" will likely give you the illusion of clearer language without actual well structured tests. agents.md is to usually change what the llm should focus on doing more that suits you. Not to say stuff like "be better", "make no mistakes"