> I have never heard anybody successfully using LLMs say this before. Most of what I've learned from talking to people about their workflows is counterintuitive and subtle.
Because for all our posturing about being skeptical and data driven we all believe in magic.
Those "counterintuitive non-trivial workflows"? They work about as well as just prompting "implement X" with no rules, agents.md, careful lists etc.
Because 1) literally no one actually measures whether magical incarnations work and 2) it's impossible to make such measurements due to non-determinism
The problem with your argument here is that you're effectively saying that developers (like myself) who put effort into figuring out good workflows for coding with LLMs are deceiving themselves, and are effectively wasting their time.
Either I've wasted significant chunks of the past ~3 years of my life or you're missing something here. Up to you to decide which you believe.
I agree that it's hard to take solid measurements due to non-determinism. The same goes for managing people, and yet somehow many good engineering managers can judge if their team is performing well and figure out what levers they can pull to help them perform better.
That's not a problem, that is the argument. People are bad at measuring their own productivity. Just because you feel more productive with an LLM does not mean you are. We need more studies and less anecdata
I'm afraid all you're going to get from me is anecdata, but I find a lot of it very compelling.
I talk to extremely experienced programmers whose opinions I have valued for many years before the current LLM boom who are now flying with LLMs - I trust their aggregate judgement.
Meanwhile my own https://tools.simonwillison.net/colophon collection has grown to over 120 in just a year and a half, most of which I wouldn't have built at all - and that's a relatively small portion of what I've been getting done with LLMs elsewhere.
Hard to measure productivity on a "wouldn't exist" to "does exist" scale.
Every time you post about this stuff you get at least as much pushback as you get affirmation, and yet when you discuss anything related to peer responses, you never seem to mention or include any of that negative feedback, only the positive...
I don't get it, what are you asking me to do here?
You want me to say "this stuff is really useful, here's why I think that. But lots of people on the internet have disagreed with me, here's links to their comments"?
> my own https://tools.simonwillison.net/colophon collection has grown to over 120
What in the wooberjabbery is this even.
List of single-commit LLM generated stuff. Vibe coded shovelware like animated-rainbow-border [1] or unix-timestamp [2].
Calling these tools seems to be overstating it.
1: https://gist.github.com/simonw/2e56ee84e7321592f79ceaed2e81b...
2: https://gist.github.com/simonw/8c04788c5e4db11f6324ef5962127...
Cool right? It's my playground for vibe coded apps, except I started it nearly a year before the term "vibe coding" was introduced.
I wrote more about it here: https://simonwillison.net/2024/Oct/21/claude-artifacts/ - and a lot of them have explanations in posts under my tools tag: https://simonwillison.net/tags/tools/
It might also be the largest collection of published chat transcripts for this kind of usage from a single person - though that's not hard since most people don't publish their prompts.
Building little things like this is really effective way of gaining experience using prompts to get useful code results out of LLMs.
> Cool right?
100s of single commit AI generated trash in the likes of "make the css background blue".
On display.
Like it's something.
You can't be serious.
[flagged]
I've been using LLM-assistance for my larger open source projects - https://github.com/simonw/datasette https://github.com/simonw/llm and https://github.com/simonw/sqlite-utils - for a couple of years now.
Also literally hundreds of smaller plugins and libraries and CLI tools, see https://github.com/simonw?tab=repositories (now at 880 repos, though a few dozen of those are scrapers and shouldn't count) and https://pypi.org/user/simonw/ (340 published packages).
Unlike my tools.simonwillison.net stuff the vast majority of those products are covered by automated tests and usually have comprehensive documentation too.
What do you mean by my script?
The whole debate about LLMs and productivity consistently brings the "don't confuse movement with progress" warning to my mind.
But it was already a warning before LLMs because, as you wrote, people are bad at measuring productivity (among many things).
Another problem with it is that you could have said the same thing about virtually any advancement in programming over the last 30 years.
There have been so many "advances" in software development in the last decades - powerful type systems, null safety, sane error handling, Erlang-style fault tolerance, property testing, model checking, etc. - and yet people continue to write garbage code in unsafe languages with underpowered IDEs.
I think many in the industry have absolutely no clue what they're doing and are bad at evaluating productivity, often prioritising short term delivery over longterm maintenance.
LLMs can absolutely be useful but I'm very concerned that some people just use them to churn out code instead of thinking more carefully about what and how to build things. I wish we had at least the same amount of discussions about those things I mentioned above as we have about whether Opus, Sonnet, GPT5 or Gemini is the best model.
> I wish we had at least the same amount of discussions about those things I mentioned above as we have about whether Opus, Sonnet, GPT5 or Gemini is the best model.
I mean we do. I think programmers are more interested in long term maintainable software than its users are. Generally that makes sense, a user doesn't really care how much effort it takes to add features or fix bugs, these are things that programmers care about. Moreover the cost of mistakes of most software is so low that most people don't seem interested in paying extra for more reliable software. The few areas of software that require high reliability are the ones regulated or are sold by companies that offer SLAs or other such reliability agreements.
My observation over the years is that maintainability and reliability are much more important to programmers who comment in online forums than they are to users. It usually comes with the pride of work that programmers have but my observation is that this has little market demand.
Users definitely care about things like reliability when they're using actually important software (which probably excludes a lot of startup junk). They may not be able to point to what causes issues, but they obviously do complain when things are buggy as hell.
> I think programmers are more interested in long term maintainable software than its users are.
Please talk to your users
> who put effort into figuring out good workflows for coding with LLMs are deceiving themselves, and are effectively wasting their time.
It's quite possible you do. Do you have any hard data justifying the claims of "this works better", or is it just a soft fuzzy feeling?
> The same goes for managing people, and yet somehow many good engineering managers can judge if their team is performing well
It's actually really easy to judge if a team is performing well.
What is hard is finding what actually makes the team perform well. And that is just as much magic as "if you just write the correct prompt everything will just work"
---
wait. why are we fighting again? :) https://dmitriid.com/everything-around-llms-is-still-magical...
I'm not the OP and I"m not saying you are wrong, but I am going to point out that the data doesn't necessarily back up significant productivity improvements with LLMs.
In this video (https://www.youtube.com/watch?v=EO3_qN_Ynsk) they present a slide by the company DX that surveyed 38,880 developers across 184 organizations, and found the surveyed developers claiming a 4 hour average time savings per developer per week. So all of these LLM workflows are only making the average developer 10% more productive in a given work week, with a bunch of developers getting less. Few developers are attaining productivity higher than that.
In this video by stanford researchers actively researching productivity using github commit data for private and public repositories (https://www.youtube.com/watch?v=tbDDYKRFjhk) they have a few very important data points in there:
1. There's zero correlation they've found between how productive respondants claim their productivity is and how it's actually measured, meaning people are poor judges of their own productivity numbers. This does refute the claims on the previous point I made but only if you assume people are wildly more productive then they claim on average.
2. They have been able to measure actual increase in rework and refactoring commits in the repositories measured as AI tools become more in use in those organizations. So even with being able to ship things faster, they are observing increase number of pull requests that need to fix those previous pushes.
3. They have measured that greenfield low complexity systems have pretty good measurements for productivity gains, but once you get more towards higher complexity systems or brownfield systems they start to measure much lower productivity gains, and even negative productivity with AI tools.
This goes hand in hand with this research paper: https://metr.org/blog/2025-07-10-early-2025-ai-experienced-o... which had experienced devs in significant long term projects lose productivity when using AI tools, but also completely thought the AI tools were making them even more productivity.
Yes, all of these studies have their flaws and nitpicks we can go over that I'm not interested in rehashing. However, there's a lot more data and studies that show AI having very marginal productivity boost compared to what people claim than vice versa. I'm legitimately interested in other studies that can show significant productivity gains in brownfield projects.
So far I've found that the people who are hating on AI are stuck maintaining highly coupled that they've invested a significant amount of mental energy internalizing. AI is bad on that type of code, and since they've invested so much energy on understanding the code, it ends up taking longer for them to load context and guide the AI than to just do the work. Their code base is hot coupled garbage, and rather than accept that the tools aren't working because of their own lack of architectural rigor, they just shit on the tools. This is part of the reason that that study of open source maintainers using Cursor didn't consistently produce improvement (also, Cursor is pretty mid).
https://www.youtube.com/watch?v=tbDDYKRFjhk&t=4s is one of the largest studies I've seen so far and it shows that when the codebase is small or engineered for AI use, >20% productivity improvements are normal.
On top of this a lot of the “learning to work with LLMs” is breaking down tasks into small pieces with clear instructions and acceptance criteria. That’s just part of working efficiently but maybe don’t want to be bothered to do it.
Working efficiently as a team, perhaps, but during solo development this is unnecessary beyond the extent that is necessary to document the code
Even this opens up a whole field of weird subtle workflow tricks people have, because people run parallel asynchronous agents that step on each other in git. Solo developers run teams now!
Really wild to hear someone say out loud "there's no learning curve to using this stuff".
The "learning curve" is reading "experts opinion" on the ever-changing set of magical rituals that may or may not work but trust us it works.
No, you do not need to trust anyone, you can just verify what works and what doesn't, it's very easy.
Indeed. And it's extremely easy to verify my original comment: https://news.ycombinator.com/item?id=44849887