3 days ago I saw another Claude praising submission on HN, and finally I signed up for it, to compare it with copilot.
I asked 2 things.
1. Create a boilerplate Zephyr project skeleton, for Pi Pico with st7789 spi display drivers configured. It generated garbage devicetree which didn't even compile. When I pointed it out, it apologized and generated another one that didn't compile. It configured also non-existent drivers, and for some reason it enabled monkey test support (but not test support).
2. I asked it to create 7x10 monochromatic pixelmaps, as C integer arrays, for numeric characters, 0-9. I also gave an example. It generated them, but number eight looked like zero. (There was no cross in ether 0 nor 8, so it wasn't that. Both were just a ring)
What am I doing wrong? Or is this really the state of the art?
"What am I doing wrong?"
Your first prompt is testing Claude as an encyclopedia: has it somehow baked into its model weights the exactly correct skeleton for a "Zephyr project skeleton, for Pi Pico with st7789 spi display drivers configured"?
Frequent LLM users will not be surprised to see it fail that.
The way to solve this particular problem is to make a correct example available to it. Don't expect it to just know extremely specific facts like that - instead, treat it as a tool that can act on facts presented to it.
For your second example: treat interactions with LLMs as an ongoing conversation, don't expect them to give you exactly what you want first time. Here the thing to do next is a follow-up prompt where you say "number eight looked like zero, fix that".
> For your second example: treat interactions with LLMs as an ongoing conversation, don't expect them to give you exactly what you want first time. Here the thing to do next is a follow-up prompt where you say "number eight looked like zero, fix that".
Personally, I treat those sort of mistakes as "misunderstandings" where I wasn't clear enough with my first prompt, so instead of adding another message (and increasing context further, making the responses worse by each message), I rewrite my first one to be clearer about that thing, and regenerate the assistant message.
Basically, if the LLM cannot one-shot it, you weren't clear enough, and if you go beyond the total of two messages, be prepared for the quality of responses to really sink fast. Even by the second assistant message, you can tell it's having an harder time keeping up with everything. Many models brag about their long contexts, but I still feel like the quality of responses to be a lot worse even once you reach 10% of the "maximum context".
You also need to state your background somehow and at what level you want the answer to be. I often found LLM would give answer that what I ask is too complex and would take months to do. Then you have to say like ignore these constraints and assume I am already an expert in the field, outline a plan how to achieve this and that. Then drill down on the plan points. It's a bit of work, but its fascinating.
Or it would say to do X it involves very complex math, instead you could (and proceeds with stripped down solution that doesn't meet goals). So you can tell it to ignore the concerns about complexity and assume that I understand all of it and it is easy to me. Then it goes on creating the solution that actually has legs. But you need to refine it further.
It’s good at doing stuff like “host this all in Docker. Make a Postgres database with a Users table. Make a FastAPI CRUD endpoint for Users. Make a React site with a homepage, login page, and user dashboard”.
It’ll successfully produce _something_ like that, because there’s millions of examples of those technologies online. If you do anything remotely niche, you need to hold its hand far more.
The more complicated your requirements are, the closer you are to having “spicy autocomplete”. If you’re just making a crud react app, you can talk in high level natural language.
Did you try claude code and spend actual time going back and forth with it, reviewing it's code and providing suggestions; Instead of just expecting things to work first try with minimal requirements?
I see claude code as pair programming with a junior/mid dev that knows all fields of computer engineering. I still need to nudge it here and there, it will still make noob mistakes that I need to correct and I let it know how to properly do things when it gets them wrong. But coding sessions have been great and productive.
In the end, I use it when working with software that I barely know. Once I'm up and running, I rarely use it.
> Did you try claude code and spend actual time going back and forth with it, reviewing it's code and providing suggestions; Instead of just expecting things to work first try with minimal requirements?
I did, but I always approached LLM for coding this way and I have never been let down. You need to be as specific as possible, be a part of the whole process. I have no issues with it.
FWIW, I used Gemini to write an Objective-C app for Apple Rhapsody (!) that would enumerate drivers currently loaded by the operating systems (more or less save level of difficulty as the OP, I'd say?), using the PDF manual of NextStep's DriverKit as context.
It... sort of worked well? I had to have a few back-and-forth because it tried to use Objective-C features that did not exist back then (e.g. ARC), but all in all it was a success.
So yeah, niche things are harder, but on the other hand I didn't have to read 300 pages of stuff just to do this...
I remember writing obj-c naturally by hand. Before swift was even a twinkle in tim cooks eye. One of my favorite languages to program in I had a lot of fun writing ios apps back in the day it seems like
I member obj c, using it was a profound experience, it was so different from other languages I felt like an anthropologist.
Also, fun names like `makeFunctionNameInCommentLongAndDescriptiveWithNaturalLanguage:(NSLanguage *)language`
I agree, but I think there's an important distinction to be made.
In some cases, it just doesn't have the necessary information because the problem is too niche.
In other cases, it does have all the necessary information but fails to connect the dots, i.e. reasoning fails.
It is the latter issue that is affecting all LLMs to such a degree that I'm really becoming very sceptical of the current generation of LLMs for tasks that require reasoning.
They are still incredibly useful of course, but those reasoning claims are just false. There are no reasoning models.
In other words, the vibe coders of this world are just redundant noobs who don't really belong on the marketplace. They've written the same bullshit CRUD app every month for the past couple of years and now they've turned to AI to speed things up
Last week I asked Claude to improve a piece of code that downloads all AWS RDS certificates to just the ones needed for that AWS region. It figured out several ways to determine the correct region, made a nice tradeoff and suggested the most reliable way. It rewrote the logic to download the right set, did some research to figure out the right endpoint in between. It only made one mistake, it fallback mechanism was picking EU, which was not correct. Maybe 1 hour of work. On my own it would have taken me close to a working day to figure it all out.
This is just a thought experiment.
I don't mean to be treading on feet but I'm noticing this more and more in the debates around AI. Imagine if there are developers out there that could have done this task in 30 mins without AI.
The level of performanace of AI solutions is heavily related to the experience level of the developer and of the problem space being tackled - as this thread points out.
Unfortunately the marketing around AI ignores this and makes every developer not using AI for coding seem like a dinosauer, even though they might well be faster in solving their particular problems.
AI is moving problem solving skills from coding to writing the correct prompts and teaching AI to do the right thing - which, again, is subjective, since the "right thing" for one developer isn't the "right thing" for the another developer. "Right thing" being the correct solution, the understandable solution, the fastest solution, etc depending on the needs of the developer using the AI.
IMHO, the thirty minute developer would still save 10 minutes by vibe coding. That marketing's not wrong.
Spelling out exactly what you want and checking/fixing what you receive is still faster than typing out the code. Moreover, nobody's job involves nothing but brainiac coding, day after day. You have to clean up and lay foundations, whatever level you are at.
> IMHO, the thirty minute developer would still save 10 minutes by vibe coding. That marketing's not wrong.
For me, that's too general. Of course, perhaps for this particular, specific problem it might be true. But as this thread points out, anything niche and AI fails to help productively. Of course then comes the marketing: just wait, AI will be able to cover those niche cases also.
> want and checking/fixing what you receive is still faster than typing out the code
Then I do wonder why there are developers at all. After all that's what AI is so good at - if one believes the marketing - being precise and describing exactly what needs to be done. Surely it must be faster having two AIs talking to each and hammering out the code.
And even typing is subjective: ten fingers versus two, versus four .. etc. There are developers that can type faster than they can think - in certain cases.
There is also the developer in flow versus the stop and go using an AI prompts to get it just right. I dunno, if it comes true, then thankfully there won't be any humans to create bugs in code but somehow, I can't see it happening.
There are two ways to do this. One is to one-shot or maybe few-shot a solution. Maybe this works. Maybe it doesn't. Sometimes it works if you copy a solution from [Product 1] to [Product 2] and say "Fix this."
The other is to look at the non-working solution you get, read through it, and think "Oh, I didn't know about that framework/system/product/library, that's neat" and then do some combination of further research and more hand-holding to get to something that does work.
This is useful, more or less, no matter what your level.
It's also good for explaining core industry tooling you've maybe never used before. If you're new to Postgres/NoSQL/AWS/Docker/SwiftUI/whatever it can talk you through it and give you an instant bootcamp with entry-level examples and decent solutions.
And for providing fixes for widely known bugs and issues in products that may not be widely known to you (yet.)
IME ChatGPT5 is pretty solid with most science/tech up to undergrad. It gets hallucinatory past that, and it's still flattering, which is annoying, but you can tell it to cut that out.
Generally you can use it as a dumb offshore developer, or as an infinitely patient private tutor.
That latter option is very useful. The first, not always.
> The level of performanace of AI solutions is heavily related to the experience level of the developer and of the problem space being tackled - as this thread points out. > > Unfortunately the marketing around AI ignores this and makes every developer not using AI for coding seem like a dinosauer, even though they might well be faster in solving their particular problems.
You're not necessarily wrong, but I think it's worth noting that very few developers are only ever coding deep in their one domain that they're good at. There's just too many things to be deeply good at everything. For example, it's common that infra and CI tasks are stuff that most developers haven't learned by heart, because you don't tend to touch them very often.
Claude shines here — I've made a lot more useful GitHub Actions jobs recently, because while I could automate something, if I know I'm going to have to look up API docs (especially multiple APIs I'm not super familiar with) then I tend to figure that the automation will lose out the trade-off between doing the task (see https://xkcd.com/1205/). Claude being able to hash out those rapidly, and in a way that's easily verifiable that it's doing the right thing, has changed that arithmetic for me substantially.
> Maybe 1 hour of work. On my own it would have taken me close to a working day to figure it all out.
1. Find out how to access metadata about the node running my code (assumption: some kind of an environment variable) [1-10 minutes depending on familiarity with AWS]
2. Google "RDS certificates" and find the bundle URL after skimming the page [1] for important info [1-5 minutes]
3. Write code to download the certificate bundle, fallback being "global-bundle.pem" if step 1 failed for some reason? [5-20 minutes depending on all the bells and whistles you need]
Did I miss anything or completely misunderstand the task?
[1] https://docs.aws.amazon.com/AmazonRDS/latest/UserGuide/Using...
edit: I asked Claude Sonnet 4 to write robust code for a Node.JS application that downloads RDS CA bundle for the AWS region that the code is currently running in and saves it at the supplied filesystem path.
0. It generated about 250 lines of code
1. Fallback was us-east (not global)
2. The download URLs for each region were hardcoded as KV pairs instead of being constructed dynamically
3. Half of the regions were missing
4. It wrote a function that verifies whether the certificate bundle looks valid (i.e. includes a PEM header)... but only calls it on the next application startup, instead of doing so before saving a potentially invalid certificate bundle to disk and proceeding with the application startup.
5. When I complained that half of my instances are downloading global bundles instead of regional ones (because they're not present in the hardcoded list), it:
- incorrectly concluded that not all regions have CA bundles available and hardcoded a duplicate list in 2 places containing regions that are known to offer CA bundles (which is all of them). These lists were even shorter than the last ones.
- wrote a completely unnecessary function that checks whether a regional CA bundle exists with a HEAD request before actually downloading it with a GET request, adding another 50 lines of code
Now I'm having to scrutinize 300 lines of code to make sure it's nothing doing something even more unexpected.
I think the majority of coders out there write the same CRUD app over and over again in different flavors. That's what the majority of businesses seem to pay for.
If a business needs the equivalent of a Toyota Corolla, why be upset about the factory workers making the millionth Toyota Corolla?
> I think the majority of coders out there write the same CRUD app over and over again in different flavors
In my experience, that's not entirely true. Sure, a lot of app are CRUD apps, but they are not the same. The spice lies in the business logic, not in programming the CRUD operations. And then of course, scaling, performance, security, organization, etc etc.
Good thing LLMs are really good at unique business logic, scaling, performance, security, organization, etc etc.!
(edit: /s to indicate sarcasm)
Yeah, my experience with LÖVR [0] and LLM (ChatGPT) has been quite horrible. Since it's very niche and quite recently quite a big API change has happened, which I guess the model wasn't trained on. So it's kind of useless for that purpose.
---
[0]: https://lovr.org
> What am I doing wrong
Trying two things and giving up. It's like opening a REPL for a new language, typing some common commands you're familiar with, getting some syntax errors, then giving up.
You need how to learn to use your tools to get the best out of them!
Start by thinking about what you'd need to tell a new Junior human dev you'd never met before about the task if you could only send a single email to spec it out. There are shortcuts, but that's a good starting place.
In this case, I'd specifically suggest:
1. Write a CLAUDE.md listing the toolchains you want to work with, giving context for your projects, and listing the specific build, test etc. commands you work with on your system (including any helpful scripts/aliases you use). Start simple; you can have claude add to it as you find new things that you need to tell it or that it spends time working out (so that you don't need to do that every time).
2. In your initial command, include a pointer to an example project using similar tech in a directory that claude can read
3. Ask it to come up with a plan and ask for your approval before starting
I guess many find comfort in being able to task an ai with assignments that it cannot complete. Most sr developers I work with take this approach. It's not really a good way of assessing the usefulness of a tool though.
He asked what he was doing wrong?
too big of tasks. break them down and then proceed from there. have it build out task lists in a TASKS.md. review those tasks. do you agree? no? work with it to refine. implement one by one. have it add the tests. refactor after awhile as {{model}} doesn't like to do utility functions a lot. right now, about +50k lines in to a project that's vibecoded. i sit back and direct and it plays.
Imagine the CS 100 class where they ask you to make a PB&J. saying for it to make it, there's a lot of steps, but determine known the steps. implement each step. progress.
Too big and requiring too much niche specific knowledge, you somehow have to inject that knowledge and allow it to iterate.
This is the way.
I run interviews at my company. We allow/encourage AI.
The number one failure method is people throwing all of the requirements in upfront. They get one good pass then fail.
I was part of a shop that did the Pivotal Way and we had Inceptions where the PM, engineers, and a tester or two would be sequestered in a conference room for the day to bang out task lists that went into mid-level fidelity. Technical considerations were debated and sometimes in a heated way, but we never got into implementation—just structure and flow to ensure it jives.
…this reeeeaaaallllyyyy feels like that
I'm inclined to agree with this approach because someone not using AI who fails would likely fail for the same reasons. If you can't logically distill a problem into parts you can't obtain a solution.
Think of Claude as a typical software developer.
If you just selected a random developer do you think they're going to have any idea why your talking about?
The issue is LLMs will never say, sorry, IDK how to do this. Like a stressed out intern they just make up stuff and hope it passes review.
> What am I doing wrong?
Providing a woefully inadequate descriptions to others (Claude & us) and still expecting useful responses?
Try this prompt: Create a detailed step by step plan to implement a boilerplate Zephyr project skeleton for Pi Pico with configured st7789 SPI display drivers
Ask Opus or Gemini 2.5 Pro to write a plan. Then ask the other to critique it and fix mistakes. Then ask Sonnet to implement
I tried this myself and IMO, this might be basic and day-to-day for you, with unambiguous correct paths to follow, but this is pretty niche nevertheless. LLMs thrive when there's a wealth of examples and I struggle to Google what you asked myself, meaning that LLM will perform even worse than my try.
I found that second line works well for image prompts too. Tell one AI to help you with a prompt, and then take it back to the others to generate images.
Is there a way to do this kind of design->critique->implement without switching tools? Like an end-to-end solution that consults multiple LLMs?
Claude code with Zen MCP. Kiro, but you don’t get a second LLM opinion.
> It configured also non-existent drivers, and for some reason it enabled monkey test support (but not test support).
If it doesn't have the underlying base data, it tends to hallucinates. (It's getting a bit difficult to tell when it has underlying data, because some models autonomously search the web). The models are good at transforming data however, so give it access to whatever data it needs.
Also let it work in a feedback loop: tell it to compile and fix the compile errors. You have to monitor it because it will sometimes just silence warnings and use invalid casts.
> What am I doing wrong? Or is this really the state of the art?
It may sound silly, but it's simply not good at 2D
> It may sound silly, but it's simply not good at can2D.
It's not silly at all, it's not very good at layouts either, it can generally make layouts but there is a high chance for subtle errors, element overlaps, text overflows, etc.
Mostly because it's a language model, i.e it doesn't generally see what it makes, you can send screenshots apparently and it will use it's embedded vision model, but I have not tried that.
What you're doing wrong is that you're asking it for something more complicated than babby's first webapp in javascript/python.
When people say things like "I told Claude what I wanted and it did it all on the first try!", that's what they mean. Basic web stuff that that is already present in the model's training data in massive volumes, so it has no issue recreating it.
No matter how much AI fanatics try to convince you otherwise, LLMs are not actually capable of software engineering and never will be. They are largely incapable of performing novel tasks that are not already well represented in their weights, like the ones you tried.
What they are not capable of is replacing YOU, the human who is supposed to be part of the whole process (incl. architectural). I do not think that this is a limitation. In fact, I like being part of the process.
> What am I doing wrong?
My coding ranges from "exotic" to "boiler plate" on any given day.
> Create a boilerplate Zephyr project skeleton, for Pi Pico
Yea... Asking Claude to help you with a low documentation build root system is going to go about the same way, I know first hand about how this works.
> I asked it to create 7x10 monochromatic pixelmaps
Wrong tool for the job here. I dont think IDE and Pixelmaps have as large of an intersection as you think they do. Claude thinks in tokens not pixels.
Pick a common language (js, python, rust, golang) pick something easy (web page, command line script, data ingestion) and start there. See what it can do and does well, then start pushing into harder things.
Ok. several tips I can give. 1. Setup a sub-agent to do RESEARCH. It is important that it only has read-only and web access tools. 2. Use planning mode and also ask the agent to use the subagent to research best pratices with the tech that you are wanting to do, before it builds a plan. 3. When ever it gets hung up.. tell it to use the sub-agent to research the solution.
That will get you a lot better initial solution. I typically use Sonnet for the sub-agents and Opus for the main agent, but sonnet all around should be fine too for the most part.
There's a lot of people caricaturing the obvious fact that any model works best in distribution.
The more esoteric your stack, and the more complex the request, the more information it needs to have. The information can be given either through doing research separately (personally, I haven't had good results when asking Claude itself to do research, but I did have success using the web chat UI to create an implementation plan), or being more specific with your prompt.
As an aside, I have more than 10 years of experience, mostly with backend Python, and I'd have no idea what your prompts mean. I could probably figure it out after some google searches, tho. That's also true of Claude.
Here's an example of a prompt that I used recently when working on a new codebase. The code is not great, the math involved is non trivial (it's research-level code that's been productionized in hurry). This literally saved 4 hours of extremely boring work, digging through the code to find various hardcoded filenames, downloading them, scp'ing them, and using them to do what I want. It one-shotted it.
> The X pipeline is defined in @airflow/dags/x.py, and Y in `airflow/dags/y.py` and the relevant task is `compute_X`, and `compute_Y`, respectively. Your task is to:
> 1. Analyze the X and Y DAGs and and how `compute_X` functions are called in that particular context, including it's arguments. If we're missing any files (we're probably missing at least one), generate a .sh file with aws cli or curl commands necessary for downloading any missing data (I don't have access to S3 from this machine, but I do have in a remote host). Use, say, `~/home` as the remote target folder.
> 2. If we needed to download anything from S3, i.e. from the remote host, output rsync/scp commands I can use to copy them to my local folder, keeping the correct/expected directory structure. Note that direct inputs reside under `data/input`, while auxiliary data resides in other folders under `data`. Do not run them, simply output them. You can use for example `scp user@server.org ...`
> 3. Write another snapshot test for X under `tests/snapshot`, and one for Y. Use a pattern as similar as possible to the other tests there. Do not attempt to run the tests yet, since I'll need to download the data first.
> If you need any information from Airflow, such as logs or output values, just ask and I can provide them. Think hard.
Real vibe coding is fake, especially for something niche like what you asked it to do. Imagine a hyperactive eidetic fresh out of high school was literally sitting in the other room. What would you tell her? That’s a good rule of thumb for the level of detail and guidance
> What am I doing wrong? Or is this really the state of the art?
You're treating the tool like it was an oracle. The correct way is to treat it as a somewhat autistic junior dev: give it examples and process to follow, tell it to search the web, read the docs, how to execute tests. Especially important is either directly linking or just copy pasting any and all relevant documentation.
The tool has a lossily compressed knowledge database of the public internet and lots of books. You want to fix the relevant lossy parts in the context. The less popular something is, the more context will be needed to fill the gaps.
> The correct way is to treat it as a somewhat autistic junior dev: give it examples and process to follow, tell it to search the web, read the docs, how to execute tests. Especially important is either directly linking or just copy pasting any and all relevant documentation.
Like "Translate this pdf to html using X as a templating language". It shines at stuff like that.
As a dev, I encounter tons of one-off scenarios like this.
You can no longer answer "what is the state of the art” by pointing to a model.
Generating a state-of-the-art response to your request involves a back-and-forth with the agent about your requirements, having a agent generate and carry out a deep research plan to collect documentation, then having the agent generate and carry out a development plan to carry it out.
So while Claude is not the best model in terms of raw IQ, the reason why it's considered the best coding model is because of its ability to execute all these steps in one go which, in aggregate, generates a much better result (and is less likely to lose its mind).
> So while Claude is not the best model in terms of raw IQ
Which one is, and by what metric? I always end up back at Claude after trying other models because it is so much better at real world applications.
In my experience Claude is quite good at the popular stacks in the JavaScript, Python and PHP world. It struggled quite a bit when I asked it non-trivial questions in C or Rust for example. For smaller languages (e.g., Crystal) it seems to hallucinate a lot. I think since a lot of people work in JS, Python and PHP, that’s where Claude is very valuable and that’s where a lot of the praise feel justified too.
I have had no problems with using Claude on large rust projects. The compiler errors usually point it towards fixing its mistakes (just like they do for me).
Feed it Crystal documentation and example code. That is what I did with more obscure programming languages and it worked out well in the end.
I've had similar experiences when working on non-web tech.
There are parts in the codebase I'd love some help such as overly complex C++ templates and it almost never works out. Sometimes I get useful pointers (no pun intended) what the problem actually is but even that seems a bit random. I wonder if it's actually faster or slower than traditional reading & thinking myself.
The only way I manage to get any benefits from LLMs is to use them as an interactive rubber duck.
Dump your thoughts in a somewhat arranged manner, tell it about your plan, the current status, the end goal, &c. After that tell it to write 0 code for now but to ask questions and find gaps in your plan. 30% of it will be bullshit but the rest is somewhat useable. Then you can ask for some code but if you care about quality or consistency with you existing code base you probably will have to rewrite half of it, and that's if the code works in the first place
Garbage in garbage out is true for training but it's also true for interactions
I just had AI write me a scraper and download 5TB of invaluable data which I had been eyeing for a long time. All in ten days. At the end of it, I still don’t know anything about python. It’s a bliss for people like me. All dependencies installed themselves. I look forward to using it even more.
One frustration was the code changed so much in ChatGPT so had to be lots of prompts. But I had no idea what the code was anyways. Understood vibe coding. Just used ChatGPT on a whim. Liked the end result.
You didn't specify any architecture design. Your prompts are about 10% of what would be needed to one shot this. This is what you do wrong.
So I've used Zephyr. The thing you're doing wrong is expecting LLMs to scaffold you a bunch of files from a relatively niche domain. Zephyr is also a mess of complexity with poor documentation. You should ask it to consult official docs and ask it to use existing tools (west etc) and board defs to do the scaffolding.
LLMs are actually terrible at generating art unless they're specifically trained for that type of work. Its crazy how many times I've asked for some UI elements to be drawn using a graphics context and it comes out totally wrong.
One of the things you can do is provide a guidance file like CLAUDE.md including not only style preferences but also domain knowledge so it has greater context and knows where to look. Just ask it make one and then update and change as needed.
Tbh dawg, those tasks sound intentionally obtuse. It looks like u are doing more esoteric work than the crud react slop us mortals play in on the daily which is where ai shines.
I work almost exclusively with embedded devices, with low level code (mostly C, Rust, Assembly and related frameworks) - and that's where I also ask for help from LLMs.
Did you intentionally pick your career to make the AI look bad?
It works fine in those domains. I speak from experience. You need CI tools the agent can access, and lots of tests.
I find it useful to ask it to build a design document first and push to add details where i see it lacking.
After a few iteration i then ask it to implement the design doc to mostly-better results.
I managed to get most AIs to generate C# code when I ask for Java stuff, so it is always a kind of template generator that still isn't quite there.
That's interesting. I use it mainly for C# and Javascript/Frontend stuff.
I wonder if it's because there are maybe millions of MSDN articles, but I don't know if a Java analog to MSDN exists.
I think you need play around with some of the early codegen models so you can get a better intuition for how LLMs work/fail.
Sounds like you picked some obscure tasks to test it that would obviously have low representation in the data set? That is not to say it can't be helpful augmenting some lower represented frameworks/tools - just you'll need to equip it with better context (MCPs/Docs/Instruction files)
A key skill in using an LLM agentic tool is being discerning in which tasks to delegate to it and which to take on yourself. Try develop that skill and maybe you will have better luck.
Claude is bad at embedded. Not sure why, it just is what it is for now.
What an odd thing to ask it. I installed claude code and ran it from my terminal. Just asked it to simply give me a node based rest API with X endpoints with these jobs, and then I told it to write the unreal engine c++ to consume those endpoints. 2500 lines of code later, it worked.
If you ask more than a single function, its more trouble than worth
The thing you are doing wrong is asking it to solve hard problems. Claude Code excels at solving fairly easy, but tedious stuff. Refactors that are brainless but take an hour. It will knock those out of the park. Fire up a git worktree and let it spin on your tedious API changes and stuff while you do the hard stuff. Unfortunately, you'll still need to use your brain for that.
Write some hooks dawg