I'm one of the regular code reviewers for Burn (a deep learning framework in Rust). I recently had to close a PR because the submitter's bug fix was clearly written entirely by an AI agent. The "fix" simply muted an error instead of addressing the root cause. This is exactly what AI tends to do when it can't identify the actual problem. The code was unnecessarily verbose and even included tests for muting the error. Based on the person's profile, I suspect their motivation was just to get a commit on their record. This is becoming a troubling trend with AI tools.

That's what I love about LLMs. You can spot it doesn't know the answer, tell it that it's wrong and it'll go, "You're absolutely right. Let me actually fix it"

It scares me how much code is being produced by people without enough experience to spot issues or people that just gave up caring. We're going to be in for wild ride when all the exploits start flowing.

My favorite LLM moment. I wrote some code, asked the LLM "Find any bugs or problems with this code" and of course what it did was hyperfocus on an out of date comment (that I didn't write). Since the problem no longer existed identified in the comment, the LLM just spat out like 100 lines of garbage to refactor the code.

> "You're absolutely right."

I admit a tendency to anthropomorphize the LLM and get irritated by this quirk of language, although it's not bad enough to prevent me from leveraging the LLM to its fullest.

The key when acknowledging fault is to show your sincerity through actual effort. For technical problems, that means demonstrating that you have worked to analyze the issue, take corrective action, and verify the solution.

But of course current LLMs are weak at understanding, so they can't pull that off. I wish that the LLM could say, "I don't know", but apparently the current tech can't know that that it doesn't know.

And so, as the LLM flails over and over, it shamelessly kisses ass and bullshits you about the work its doing.

I figure that this quirk of LLMs will be minimized in the near future by tweaking the language to be slightly less obsequious. Improved modeling and acknowledging uncertainty will be a heavier lift.

It'd be nice if github had a feature that updated the issue with this context automatically too, so that if this agent gives up and closes the PR, the next agent doesn't go and do the exact same thing.

> tell it that it's wrong and it'll go, "You're absolutely right. Let me actually fix it"

...and then it still doesn't actually fix it

So, I recently have done my first couple heavily AI augmented tasks for hobby projects.

I wrote a TON of LVGL code. The result wasn’t perfect for placement, but when I iterated a couple of times, it fixed almost all of the issues. The result is a little hacked together but a bit better than my typical first pass writing UI code. I think this saved me a factor of 10 in time. Next I am going to see how much of the cleanup and factoring of the pile of code it can do.

Next I had it write a bunch of low level code to init hardware. It saved me a little time compared to reading the reference manual, and was more pleasant, but it wasn’t perfectly correct. If I did not have domain expertise I would not have been able to complete the task with the LLM.

When you argued that it saved you time by a factor of 10, have you even measured that properly? I initially also had the feeling that LLMs save me time, but in the end it didn't. I roughly compared my performance to past performance by the amount of stories done and LLMs made me slower even if I thought I am saving time...

From several month of deep work with LLMs I think they are amazing pattern matchers, but not problem solvers. They suggest a solution pattern based on their trained weights. This even could result in real solutions, e.g., when programming Tetris or so, but not when working on somewhat unique problems...

I am pretty confident. Last similar LVGL thing I did took me 10-12 hours, and I had a quicker iteration time (running locally instead of the test hardware). Here I spent a little more than an hour, testing on real hardware, and the last 20 minutes was nitpicking.

Writing front-end display code and instantiating components to look right is very much playing to the model’s strength, though. A carefully written sentence plus context would become 40 lines of detail-dense but formulaic code.

(I have also had a lot of luck asking it to make a first pass at typesetting things in Tex, too, for similar reasons)

There was a recent study that found that LLM users in general tend to feel like they were more productive with AI while actually being less productive.

presumably the study this very HN discussion responds to.

Heh, yep. Guess I sometimes forget to read the content before commenting too.

> If I did not have domain expertise I would not have been able to complete the task with the LLM.

This kind of sums up my experience with LLMs too. They save me a lot of time reading documentation, but I need to review a lot of what they write, or it will just become too brittle and verbose.

I was trying out Copilot recently for something trivial. It made the change as requested, but also added a comment that stated something obvious.

I asked it to remove the comment, which it enthusiastically agreed to, and then... didn't. I couldn't tell if it was the LLM being dense or just a bug in Copilot's implementation.

Some prompts can help:

"Find the root cause of this problem and explain it"

"Explain why the previous fix didn't work."

Often, it's best to undo the action and provide more context/tips.

Often, switching to Gemini 2.5 Pro when Claude is stumped helps a lot.

My favourite recent experience was switching multiple times between using a library function and rolling its own implementation, each time claiming that it's "simplifying" the code and making it "more reliable".

Sometimes it does... sometimes.

I recently had a nice conversation looking for some reading suggestions from an LLM. The first round of suggestions were superb, some of them I'd already read, some were entirely new and turned out great. Maybe a dozen or so great suggestions. Then it was like squeezing blood from a stone but I did get a few more. After that it was like talking to a babbling idiot. Repeating the same suggestions over and over, failing to listen to instructions, and generally just being useless.

LLMs are great on the first pass but the further you get away from that they degrade into uselessness.

Yeah, when I first heard about "one-shot"ing it felt more like a trick instead of a useful heuristic but with time my experience mimics yours, nowadays I try to one-shot small-ish changes instead of going back and forth.

I've had some luck in these cases prompting "your context seems to be getting too bloated. summarize this conversation into a prompt that I can feed into a new chat with a fresh context. make sure to include <...>".

Sometimes it works well the first time, and sometimes it spits out a summary where you can see what it is confused about, and you can guide it to create a better summary. Sometimes just having that summary in its context gets it over the hump and you can just say "actually I'm going to continue with you; please reference this summary going forward", and sometimes you actually do have to restart the LLM with the new context. And of course sometimes there's nothing that works at all.

I’ve had really good luck with having gpt generate a todo list that’s very, very detailed. Then having Claude use it to check items off. Still far from perfect but since doing that haven’t run into context issues since I can just start a new chat and feed it the todo (the todo also contains project info).

I also get things like this from very experienced engineers working outside their area of expertise. It's obviously less of the completely boneheaded suggestion but still doing exactly the wrong thing suggested by AI that required a person to step in and correct.

I recently reviewed a MR from a coworker. There was a test that was clearly written by AI, except I guess however he prompted it, it gave some rather poor variable names like "thing1", "thing2", etc. in test cases. Basically, these were multiple permutations of data that all needed to be represented in the result set. So I asked for them to be named distinctively, maybe by what makes them special.

It's clear he just took that feedback and asked the AI to make the change, and it came up with a change that gave them all very long, very unique names, that just listed all the unique properties in the test case. But to the extent that they sort of became noise.

It's clear writing the PR was very fast for that developer, I'm sure they felt they were X times faster than writing it themselves. But this isn't a good outcome for the tool either. And I'm sure if they'd reviewed it to the extent I did, a lot of that gained time would have dissipated.

>a deep learning framework in Rust [...] This is becoming a troubling trend with AI tools.

The serpent is devouring its own tail.

OTOH when they'll start getting good AI contributions, then... it'll be too late for us all.

Deep learning can be incredibly cool and not just used for AI slop.

> I suspect their motivation was just to get a commit on their record. This is becoming a troubling trend with AI tools.

It has been for a while, AI just makes SPAM more effective:

https://news.ycombinator.com/item?id=24643894

This is the most frustrating thing LLMs do. They put wide try:catch structures around the code making it impossible to actually track down the source of a problem. I want my code to fail fast and HARD during development so I can solve every problem immediately.

Seems like there's a need for github to create a separate flow for AI-cretaed PRs. Project maintainers should be able to stipulate rules like this in English, and an AI "pre-reviewer" would check that the AI has followed all these rules before the PR is created, and chat with the AI submitter to resolve any violations. For exceptional cases, a human submitter is required.

Granted, the compute required is probably more expensive than github would offer for free, and IDK whether it'd be within budget for many open-source projects.

Also granted, something like this may be useful for human-sourced PRs as well, though perhaps post-submission so that maintainers can see and provide some manual assistance if desired. (And also granted, in some cases maybe maintainers would want to provide manual assistance to AI submissions, but I expect the initial triaging based on whether it's a human or AI would be what makes sense in most cases).

This is my number one complaint with LLM produced code too. The worst thing is when it swallows an error to print its own error message with far less info and no traceback.

In my rules I tell it that try catches are completely banned unless I explicitly ask for one (an okay tradeoff, since usually my error boundaries are pretty wide and I know where I want them). I know the context length is getting too long when it starts ignore that.

> The "fix" simply muted an error instead of addressing the root cause.

FWIW, I have seen human developers do this countless times. In fact there are many people in engineering that will argue for these kinds of "fixes" by default. Usually it's in closed-source projects where the shittiness is hidden from the world, but trust me, it's common.

> I suspect their motivation was just to get a commit on their record. This is becoming a troubling trend with AI tools.

There was already a problem (pre-AI) with shitty PRs on GitHub made to try to game a system. Regardless of how they made the change, the underlying problem is a policy one: how to deal with people making shitty changes for ulterior motives. I expect the solution is actually more AI to detect shitty changes from suspicious submitters.

Another solution (that I know nobody's going to go for): stop using GitHub. Back in the "olden times", we just had CVS, mailing lists and patches. You had to perform some effort in order to get to the point of getting the change done and merged, and it was not necessarily obvious afterward that you had contributed. This would probably stop 99% of people who are hoping for a quick change to boost their profile.

I will never forget being in a code review for a upcoming release, there was a method that was... different. Like massively different with no good reason why it was changed as much as it was for such a small addition.

We asked the person why they made the change, and "silence". They had no reason. It became painfully clear that all they did was copy and paste the method into an LLM and say "add this thing" and it spit out a completely redone method.

So now we had a change that no one in the company actually knew just because the developer took a shortcut. (this change was rejected and reverted).

The scariest thing to me is no one actually knowing what code is running anymore with these models having a tendency to make change for the sake of making change (and likely not actually addressing the root thing but a shortcut like you mentioned)

As a side question: I work in AI, but mostly python and theory work. How can I best jump into Burn? Rust has been intriguing to me for a long time

This is a real problem that’s only going to get worse. With the major model providers basically keeping all the data themselves, I frankly don’t like this trend long term.

You should be rejecting the PR because the fix was insufficient, not because it was AI agent written. Bad code is bad code regardless of the source. I think the fixation on how the code was generated is not productive.

No, that's not how code review works. Getting inside the mind of the developer, understanding how they thought about the fix, is critical to the review process.

If an actual developer wrote this code and submitted it willingly, it would either constitute malice, an attempt to sabotage the codebase or inject a trojan, or stupidity, for failing to understand the purpose of the error message. With an LLM we mostly have stupidity. Flagging it as such reveals the source of the stupidity, as LLMs do not actually understand anything.

The problem is that code often takes as long to review as to write, and AI potentially reduces the quality bar to pull requests. So maintainers have a problem of lots of low quality PRs that take time to reject

> You should be rejecting the PR because the fix was insufficient

I mean they probly could've articulated it your way, but I think that's basically what they did... they point out the insufficient "fix" later, but the root cause of the "fix" was blind trust in AI output, so that's the part of the story they lead with.