The light switch moment for me is when I realized I can tell claude to use linters instead of telling it to look for problems itself. The later generally works but having it call tools is way more efficient. I didn't even tell it what linters to use, I asked it for suggestions and it gave me about a dozen of suggestions, I installed them and it started using them without further instruction.

I had tried coding with ChatGPT a year or so ago and the effort needed to get anything useful out of it greatly exceeded any benifit, so I went into CC with low expectations, but have been blown away.

As an extension of this idea: for some tasks, rather than asking Claude Code to do a thing, you can often get better results from asking Claude Code to write and run a script to do the thing.

Example: read this log file and extract XYZ from it and show me a table of the results. Instead of having the agent read in the whole log file into the context and try to process it with raw LLM attention, you can get it to read in a sample and then write a script to process the whole thing. This works particularly well when you want to do something with math, like compute a mean or a median. LLMs are bad at doing math on their own, and good at writing scripts to do math for them.

A lot of interesting techniques become possible when you have an agent that can write quick scripts or CLI tools for you, on the fly, and run them as well.

It's a bit annoying that you have to tell it to do it, though. Humans (or at least programmers) "build the tools to solve the problem" so intuitively and automatically when the problem starts to "feel hard", that it doesn't often occur to the average programmer that LLMs don't think like this.

When you tell an LLM to check the code for errors, the LLM could simply "realize" that the problem is complex enough to warrant building [or finding+configuring] an appropriate tool to solve the problem, and so start doing that... but instead, even for the hardest problems, the LLM will try to brute-force a solution just by "staring at the code really hard."

(To quote a certain cartoon squirrel, "that trick never works!" And to paraphrase the LLM's predictable response, "this time for sure!")

As the other commenter said, these days Claude Code often does actually reach for a script on its own, or for simpler tasks it will do a bash incantation with grep and sed.

That is for tasks where a programmatic script solution is a good idea though. I don't think your example of "check the code for errors" really falls in that category - how would you write a script to do that? "Staring at the code really hard" to catch errors that could never have been caught with any static analysis tool is actually where an LLM really shines! Unless by "check for errors" you just meant "run a static analysis tool", in which case sure, it should run the linter or typechecker or whatever.

Running “the” existing configured linter (or what-have-you) is the easy problem. The interesting question is whether the LLM would decide of its own volition to add a linter to a project that doesn’t have one; and where the invoking user potentially doesn’t even know that linting is a thing, and certainly didn’t ask the LLM to do anything to the project workflow, only to solve the immediate problem of proving that a certain code file is syntactically valid / “not broken” / etc.

After all, solving an immediate problem that seems like it could come up again, by “taking the opportunity” to solve the problem from now on by introducing workflow automation to solve the problem, is what an experienced human engineer would likely do in such a situation (if they aren’t pressed for time.)

I've had multiple cases where it will rather write a script to test a thing than actually adding a damn unit test for it :)

> Humans (or at least programmers) "build the tools to solve the problem" so intuitively and automatically when the problem starts to "feel hard", that it doesn't often occur to the average programmer that LLMs don't think like this.

Hmm. My experience of "the average programmer" doesn't look like yours and looks more like the LLM :/

I'm constantly flabbergasted as to how way too many devs fumble through digging into logs or extracting information or what have you because it simply doesn't occur to them that tools can be composed together.

> Humans (or at least programmers) "build the tools to solve the problem" so intuitively and automatically

From my experience, only a few rare devs do this. Most will stick with (broken/wrong) GUI tools they know made by others, by convenience.

I have the opposite experience.

I used claude to translate my application and I asked him to translate each text in the application to his best abilities.

That worked great for one view, but when I asked him to translate the rest of the application in the same fashion he got lazy and started to write a script to substitute some words instead of actually translating sentences.

Cursor likes to create one-off scripts, yesterday it filled a folder with 10 of them until it figured out a bug. All the while I was thinking - will it remember to delete the scripts or is it going to spam me like that?

>It's a bit annoying that you have to tell it to do it, though.

https://www.youtube.com/watch?v=kBLkX2VaQs4

Cursor does this for me already all the time though, give that another shot maybe. For refactoring tasks in particular; it uses regex to find interesting locations , and the other day after maybe 10 of slow "ok now let me update this file... ok now let me update this file..." it suddenly paused, looked at the pattern so far, and then decided to write a python script to do the refactoring & executed it. For some reason it considered its work done even though the files didn't even pass linters but thats' polish.

+1, cursor and Claude code do this automatically for me. Take a big analysis task and they’ll write python scripts to find the needles in the haystacks that I’m looking through

Yeah, I had Cursor refactor a large TypeScript file today and it used a script to do it. I was impressed.

[deleted]

Codex is a lot better at this. It will even try this on its own sometimes. It also has much better sandboxing (which means it needs approvals far less often), which makes this much faster.

Same here, I have a SQLite db that I have let it look over and extract data. I let it build the scripts then I run them as they would timeout if not and I don't want Claude sitting waiting for 30 min. So I do all the data investigations with Claude as a expert who can traverse the data much faster then me.

I've noticed Claude doing this for most tasks without even asking it to. Maybe a recent thing?

Yes. But not always. It's better if you add a line somewhere reminding it.

The lightbulb moment for me was to have it make me a smoke test and to tell to run the test and fix issues (with the code it generated) until it passes. iterate over all features in the Todo.md (that I asked it to make). Claude code will go off and do stuff for I dunno, hours?, while I work on something else.

Hours? Not in my experience. It will do a handful of tasks then say “Great! I’ve finished a block of tasks” and stop. and honestly, you’re gonna want to check its work periodically. You can’t even trust it to run litters and unit test reliably. I’ve lost count of how many times it’s skipped pre-commit checks or committed code with failing tests because it just gives up.

I once had the Gemini CLI get into a loop of failures followed by self-flagellation where it ended saying something like "I'm sorry I have failed you, you should go and find someone capable of helping you."

I saw on X someone posted a screenshot where Gemini got depressed after repeated failure, apologized and actually uninstalled itself. Honorable seppuku.

genius i gotta try this

I have a Just task that runs linters (ruff and pyright, in my case), formatter, tests and pre-commit hooks, and have Claude run it every time it thinks it's done with a change. It's good enough that when the checks pass, it's usually complete.

(I code mostly in Go)

I have a `task build` command that runs linters, tests and builds the project. All the commands have verbosity tuned down to minimum to not waste context on useless crap.

Claude remembers to do it pretty well. I have it in my global CLAUDE.md sot I guess it has more weight? Dunno.

A tip for everyone doing this: pipe the linters' stdout to /dev/null to save on tokens.

Why? The agent needs the error messages from the linters to know what to do.

If you're running linters for formatting etc, just get the agent to run them on autocorrect and it doesn't need to know the status as urgently.

That's just one part of it. I want the LLM to see type checking errors, failing test outputs, etc.

Errors shouldn’t be on stdout ;)

“Errors” printed by your linter aren’t errors, they’re reports

This is the best way to approach it but if I had a dollar for each time Claude ran “—no-verify” on the git commits it was doing I’d have 10’s of dollars.

Doesn’t matter if you tell it multiple times in CLAUDE.md to not skip checks, it will eventually just skip them so it can commit. It’s infuriating.

I hope that as CC evolves there is a better way to tell/force the model to do things like that (linters, formatters, unit/e2e tests, etc).

We should have a finish hook that, when the AI decides it's run, runs the hook, and gives it to the LLM, and it can decide whether the problem is still there.

Students don't get to choose whether to take the test, so why do we give AI the choice?

I’ve found the same issue and also with Rust sometimes skips tests if it thinks they’re taking too long to compile, and says it’s unnecessary because it knows they’ll pass.

Even AI understands it's Friday. Just push to to production and go home for the weekend.

a wrapper script?

How is this better than calling `cargo clippy` or similar commands yourself?

Claude can then proceed to fix the issues for you

Presumably cargo clippy --fix was the intention. Not all things are fixable, though, which is where LLMs are reasonable for -- the squishy hard-to-autofix things.