> I am managing projects in languages I am not fluent in—TypeScript, Rust and Go—and seem to be doing pretty well.

This framing reminds me of the classic problem in media literacy: people know when a journalistic source is poor when they’re a subject matter expert, but tend to assume that the same source is at least passably good when less familiar with the subject.

I’ve had the same experience as the author when doing web development with LLMs: it seems to be doing a pretty good job, at least compared to the mess I would make. But I’m not actually qualified to make that determination, and I think a nontrivial amount of AI value is derived from engineers thinking that they are qualified as such.

Yup — this doesn't match my experience using Rust with Claude. I've spent 2.5 years writing Rust professionally, and I'm pretty good at it. Claude will hallucinate things about Rust code because it’s a statistical model, not a static analysis tool. When it’s able to create code that compiles, the code is invariably inefficient and ugly.

But if you want it to generate chunks of usable and eloquent Python from scratch, it’s pretty decent.

And, FWIW, I’m not fluent in Python.

> Claude will hallucinate things about Rust code because it’s a statistical model, not a static analysis tool.

I think that's the point of the article.

In a dynamic language or a compiled language, its going to be hallucinating either way. If you vibe coding the errors are caught earlier so you can vibe code them away before it blows up at run time.

Static analysis tools like rustc and clippy are powerful, but there are large classes of errors that escape those analyses — e.g. things like off-by-one errors.

> If you vibe coding the errors are caught earlier so you can vibe code them away before it blows up at run time

You can say that again.

I was looking into the many comments for this particular comment and you did hit the nail on the head.

The irony is that it took the entire GenAI -> LLM -> vibe coding cycle to settle the argument that typed language is better for human coding and software engineering.

Sure, but in my experience the advantage is less than one would imagine. LLMs are really good at pattern matching and as long as they have the API and the relevant source code in their context they wont make many/any of the errors that humans are prone to.

Hah... yeah, no, its Python isn't great. I'd definitely workable and better than what I see from 9/10 junior engineers, but it tends to be pretty verbose and over-engineered.

My repos all have pre-commit hooks which run the linters/formatters/type-checkers. Both Claude and Gemini will sometimes write code that won't get past mypy and they'll then struggle to get it typed correct before eventually by passing the pre-commit check with `git commit -n`.

I've had to add some fairly specific instructions to CLAUDE.md/GEMINI.md to get them to cut this out.

Claude is better about following the rules. Gemini just flat out ignores instructions. I've also found Gemini is more likely to get stuck in a loop and give up.

That said, I'm saying this after about 100 hours of experience with these LLMs. I'm sure they'll get better with their output and I'll get better with my input.

I can confirm input matters a lot. I'm a couple of hundred hours ahead of you and my prompting has come along a lot. I recommend test cycles, prompts to reflect on product-implementation fit (eg, is this what you've been asked to do?) and lots of interactivity. Despite what I've written elsewhere in these comments, the best work is a good oneshot followed by small iterations and attentive steering.

[deleted]

To be fair, depending on what libraries you’re using, Python typing isn’t exactly easy even for a human, I spend more time battling with type checkers and stubs than I would like.

With access to good MCP tools, I've had really good experience using claude code to write rust: https://news.ycombinator.com/item?id=44702820

What MCP tools are you using?

Honestly, it's mostly just some random LSP adapter I forked and fixed a few bugs on, and it's not even that comprehensive but it goes a long way and seems most essential. Then I have some notes in the long term context about how to use a combination of gh CLI and cargo docs to read documentation and dependency source code/examples.

A few things beyond your question, for anyone curious:

I've also poked around with a custom MCP server that attempts to teach the LLM how to use ast-grep, but that didn't really work as hoped. It helps sometimes but my next shot on that project will be to rely on GritQL. Smaller LLMs stumble over the YAML indentation. GritQL is more like a template language for AST aware code transformations.

Lastly, there are probably a lot of little things in my long term context that help get into a successful flow. I wouldn't be surprised if a key difference between getting good results and getting bad results with these agentic LLM tools is how people are reacting to failures. If a failure makes you immediately throw up your hands and give up, you're not doing it right. If instead you press the little '#' (in claude code) and enter some instructions to the long term context memory, you'll get results. It's about persistence and really learning to understand these things as tools.

LLMs are famously bad at producing rust code. I'm not sure how much of it is the lesser amount of Rust code in the training data, or just the fact that Rust has a very large number of pitfalls, and a large standard library with many edge cases and things you'd imaging should exist but don't for a variety of reasons. Rust also has a much wider variety in the way things could be structured, compared to something like go where there is often only one way of doing a particular thing.

Honestly, I don't think these are problems that Rust has. What I see LLMs struggle with in Rust is more to do with understanding the language semantics at a fundamental level - exactly the things that the compiler statically verifies. For example, they will see things they think are "use-after-free" or "use-after-move", neither of which is a thing in (safe) Rust, because they don't understand that the language does not have these problems.

Largely I think LLMs struggle with Rust because it is one of very few languages that actually does something new. The semantics are just way more different than the difference between, say, Go and TypeScript. I imagine they would struggle just as much with Haskell, Ocaml, Prolog, and other interesting languages.

Obviously you can write a use-after-free in Rust. The fact that it won't compile doesn't really matter when you're feeding the text to a non-compiler program like an LLM. I trust you don't mean to get carried away and suggest that they're somehow grammatically impossible.

I feel like I have had just as much luck with LLMs writing Rust as I have had with Java, Kotlin, and Swift. Which is better than C++ and worse than Python. I think that mostly comes down to the relative abundance of training data for these types of codebases.

But that is all independent of how the LLMs are used, especially in an agentic coding environment. Strong/static typed languages with good compiler messages have a very fast feedback loop via parsing and typechecking, and agentic coding systems that are properly guided (with rulesets like Claude.md files) can iterate much quicker because of it.

I find that even with relatively obscure languages (like OCaml and Scala), the time and effort it takes to get good outcomes is dramatically reduced, albeit with a higher cost due to the fact that they don't usually get it right on the first try.

I have had very good results using Claude to write Rust. My prompting is often something like

'I have a database table Foo, here is the DDL: <sql>, create CRUD end points at /v0/foo; and use the same coding conventions used for Bar.'

I find it copies existing code style pretty well.

> When it’s able to create code that compiles, the code is invariably inefficient and ugly.

Why not have static analysis tools on the other side of those generations that constrain how the LLM can write the code?

> Why not have static analysis tools on the other side of those generations that constrain how the LLM can write the code?

We do have it, we call those programmers, without such tools you don't get much useful output at all. But other than that static analysis tools aren't powerful enough to detect the kind of problems and issues these language models creates.

I'd be interested to know the answer to this as well. Considering the wealth of AI IDE integrations, it's very eyebrow-raising that there are zero instances of this. Seems like somewhat low hanging fruit to rule out tokens that are clearly syntactically or semantically invalid.

I’d like to constrain the output of the LLM by accessing the probabilities for the next token, pick the next token that has the highest probability and also is valid in the type system, and use that. Originally OpenAI did give you the probabilities for the next token, but apparently that made it easy to steal the weights, so they turned that feature off.

It's been tried already and doesn't work. Very often a model needs to emit tokens that aren't valid yet but will become so later.

This can be done: I gave mine a justfile and early in the project very attentively steered it towards building out quality checks. CLAUDE.md also contains instructions to run those after each iteration.

What I'd like to see is the CLI's interaction with VSCode etc extending to understand things which the IDE has given us for free for years.

> When it’s able to create code that compiles, the code is invariably inefficient and ugly.

At the end of the day this is a trivial problem. When Claude Code finishes a commit, just spin up another Claude Code instance and say "run a git diff, find and fix inefficient and ugly code, and make sure it still compiles."

After decades of writing software, I feel like I have a pretty good sense for "this can't possibly be idiomatic" in a new language. If I sniff something is off, I start Googling for reference code, large projects in that language, etc.

You can also just ask the LLM: are you sure this is idiomatic?

Of course it may lie to you...

> You can also just ask the LLM: are you sure this is idiomatic?

I found the reverse flow to be better. Never argue. Start asking questions first. "What is the idiomatic way of doing x in y?" or "Describe idiomatic y when working on x" or similar.

Then gather some stuff out of the "pedantic" generations and add to your constraints, model.md, task.md or whatever your stuff uses.

You can also use this for a feedback loop. "Here's a task and some code, here are some idiomatic concepts in y, please provide feedback on adherence to these standards".

> If I sniff something is off, I start Googling for reference code, large projects in that language, etc.

This works so long as you know how to ask the question. But it's been my experience that an LLM directed on a task will do something, and I don't even know how to frame its behavior in language in a way that would make sense to search for.

(My experience here is with frontend in particular: I'm not much of a JS/TS/HTML/CSS person, and LLMs produce outputs that look really good to me. But I don't know how to even begin to verify that they are in fact good or idiomatic, since there's more often than not multiple layers of intermediating abstractions that I'm not already familiar with.)

I'm not much of a JS/TS/HTML/CSS person either. But if I think something looks off and it's something I care about, then I'll lose a day boning up on that thing.

To your point that you're not sure what to search for, I do the same thing I always do: I start searching for reference documentation, reading it, and augmenting that with whatever prominent code bases/projects I can find.

This motivates the question: if you're doing all this work to verify the LLM, is the LLM really saving you anytime?

After just a few weeks in this brave new world my answer is: it depends, and I'm not really sure.

I think over time as both the LLMs get better and I get better at working with them, I'll start trusting them more.

One thing that would help with that would be for them to become a lot less random and less sensitive to their prompts.

> and I don't even know how to frame its behavior in language in a way that would make sense to search for.

Have you tried recursion? Something like: "Using idiomatic terminology from the foo language ecosystem, explain what function x is doing."

If all goes well it will hand you the correct terminology to frame your earlier question. Then you can do what the adjacent comment describes and ask it what the idiomatic way of doing p in q is.

I think you’re missing the point. The point is that I’m not qualified to evaluate the LLM’s output in this context. Having it self-report doesn’t change that fact, it’s just playing hide the pickle by moving the evaluation around.

Not at all - my point was that it can effectively tutor you sufficiently for you to figure out if the code it wrote earlier was passable or not. These things are unbelievably good at knowledge retrieval and synthesis. Gemini makes lots of boneheaded mistakes when it comes to the finer points of C++ but it has an uncanny ability to produce documentation and snippets in the immediate vicinity of what I'm after.

Sure, that approach could fail in the face of it having solidly internalized an absolutely backwards conception of an entire area. But that seems exceedingly unlikely to me.

It will also be incredibly time consuming if you're starting from zero on the topic in question. But then if you're trying to write related code you were already committed to that uphill battle, right?

I think the concept of "readability" is good, it's a program within Google where your code gets reviewed by an expert in that language (but not necessarily your application / domain); once you're up to a level of writing idiomatic code and fully understanding the language etc, you get readability yourself.

When reviewing LLM code, you should have this readability in the given language yourself - or the code should not be important.

Gell-Mann Amnesia [0]

[0] https://en.m.wikipedia.org/wiki/Gell-Mann_amnesia_effect

Thank you! I couldn’t remember the term.

> I couldn’t remember the term.

That's lethologica! Or maybe in this specific case lethonomia. [0]

[0] https://en.m.wikipedia.org/wiki/Tip_of_the_tongue

[deleted]

Why I only use it on stuff I can properly judge.

[deleted]