I've noticed a huge gap between AI use on greenfield projects and brownfield projects. The first day of working on a greenfield project I can accomplish a week of work. But the second day I can accomplish a few days of work. By the end of the first week I'm getting a 20% productivity gain.

I think AI is just allowing everyone to speed-run the innovator's dilemma. Anyone can create a small version of anything, while big orgs will struggle to move quickly as before.

The interesting bit is going to be whether we see AI being used in maturing those small systems into big complex ones that account for the edge cases, meet all the requirements, scale as needed, etc. That's hard for humans to do, and particularly while still moving. I've not see any of this from AI yet outside of either a) very directed small changes to large complex systems, or b) plugins/extensions/etc along a well define set of rails.

Enterprise IT dinosaur here, seconding this perspective and the author’s.

When I needed to bash out a quick Hashicorp Packer buildfile without prior experience beyond a bit of Vault and Terraform, local AI was a godsend at getting me 80% of the way there in seconds. I could read it, edit it, test it, and move much faster than Packer’s own thin “getting started” guide offered. The net result was zero prior knowledge to a hardened OS image and repeatable pipeline in under a week.

On the flip side, asking a chatbot about my GPOs? Or trusting it to change network firewalls and segmentation rules? Letting it run wild in the existing house of cards at the core of most enterprises? Absolutely hell no the fuck not. The longer something exists, the more likely a chatbot is to fuck it up by simple virtue of how they’re trained (pattern matching and prediction) versus how infrastructure ages (the older it is or the more often it changes, the less likely it is to be predictable), and I don’t see that changing with LLMs.

LLMs really are a game changer for my personal sales pitch of being a single dinosaur army for IT in small to medium-sized enterprises.

Yeah, I use it to get some basic info about topic I know little of (as google search is getting worse by the day..). That then I check.

Honestly the absolute revolution for me would be if someone managed to make LLM tell "sorry I don't know enough about the topic", one time I made a typo in a project name I wanted some info on and it outright invented commands and usages (that also were different than the project I was looking for so it didn't "correct the typo") out of thin air...

> Honestly the absolute revolution for me would be if someone managed to make LLM tell "sorry I don't know enough about the topic"

https://arxiv.org/abs/2509.04664

According to that OpenAI paper, models hallucinate in part because they are optimized on benchmarks that involve guessing. If you make a model that refuses to answer when unsure, you will not get SOTA performance on existing benchmarks and everyone will discount your work. If you create a new benchmark that penalizes guessing, everyone will think you are just creating benchmarks that advantage yourself.

That is such a cop-out, if there was a really good benchmark for getting rid of hallucinations then it would be included in every eval comparison graph.

The real reason is that every bench I've seen has Anthropic with lower hallucinations.

...or they hallicunate because of floating point issues in parallel execution environments:

https://thinkingmachines.ai/blog/defeating-nondeterminism-in...

Holy perverse incentives, Batman

>LLMs really are a game changer for my personal sales pitch of being a single dinosaur army for IT in small to medium-sized enterprises.

This is essentially what I'm doing too but I expect in a different country. I'm finding it incredibly difficult to successfully speak to people. How are you making headway? I'm very curious how you're leveraging AI messaging to clients/prospective clients that doesn't just come across as "I farm out work to an AI and yolo".

Edit - if you don't mind sharing, of course.

Oh, I should be more clear: I only use local models to help me accelerate non-production work. I never use it for communications, because A) I want to improve my skills there, and B) I don't want my first (or any) outreach to another human to be a bot.

Mostly it's been an excellent way to translate vocabulary between products or technologies for me. When I'm working on something new (e.g., Hashicorp Packer) and lack the specific vocabulary, I may query Qwen or Ministral with what I want to do ("Build a Windows 11 image that executes scripts after startup but before sysprep"), then use its output as a starting point for what I actually want to accomplish. I've also tinkered with it at home for writing API integrations or parsing JSON with RegEx for Home Assistant uses, and found it very useful in low-risk environments.

Thus far, they don't consistently spit out functional code. I still have to do a back-and-forth to troubleshoot the output and make it secure and functional within my environments, and that's fine - it's how I learn, after all. When it comes to, say, SQL (which I understand conceptually, but not necessarily specifically), it's a slightly bigger crutch until I can start running on my own two feet.

Still cheaper than a proper consultant or SME, though, and for most enterprise workloads that's good (and cheap) enough once I've sanity checked it with a colleague or in a local dev/sandbox environment.

I interpreted his statement as LLMs being valuable for the actual marketing itself.

[deleted]

Which local AI do you use? I am local-curious, but don’t know which models to try, as people mention them by model name much less than their cloud counterparts.

I'm frequently rotating and experimenting specifically because I don't want to be dependent upon a single model when everything changes week-to-week; focusing on foundations, not processes. Right now, I've got a Ministral 3 14B reasoning model and Qwen3 8B model on my Macbook Pro; I think my RTX 3090 rig uses a slightly larger parameter/less quantized Ministral model by default, and juggles old Gemini/OpenAI "open weights" models as they're released.

I let Claude configure en setup entire systems now. Requires some manual auditing and steering once in a while. But managing barebone servers without any management software has become pretty feasible and cheap. I managed to configure +50 Debian server cluster simultaneously with just ssh and Claude. Yes it's cowboy 3.0. But so are our products/sites.

When you use phrases like "managed to configure" to describe your production systems, it does not inspire confidence in long-term sustainability of those systems.

[deleted]

I've had code look having claude code use ssh with root to deploy code, change configurations, and debug bad configs / selinux policy / etc. Debugging servers is not that different than debugging code. You just need to give it a way to test.

Isn't this true of any greenfield project? with or without generative models. The first few days are amazingly productive. and then features and fixes get slower and slower. And you get to see how good an engineer you really are, as your initial architecture starts straining under the demands of changing real world requirements and you hope it holds together long enough to ship something.

"I could make that in a weekend"

"The first 80% of a project takes 80% of the time, the remaining 20% takes the other 80% of the time"

> Isn't this true of any greenfield project?

That is a good point and true to some extent. But IME with AI, both the initial speedup and the eventual slowdown are accelerated vs. a human.

I've been thinking that one reason is that while AI coding generates code far faster (on a greenfield project I estimate about 50x), it also generates tech-debt at a hyperastonishing rate.

It used to be that tech debt started to catch up with teams in a few years, but with AI coded software it's only a few months into it that tech debt is so massive that it is slowing progress down.

I also find that I can keep the tech debt in check by using the bot only as a junior engineer, where I specify precisely the architecture and the design down to object and function definitions and I only let the bot write individual functions at a time.

That is much slower, but also much more sustainable. I'd estimate my productivity gains are "only" 2x to 3x (instead of ~50x) but tech debt accumulates no faster than a purely human-coded project.

This is based on various projects only about one year into it, so time will tell how it evolves longer term.

In your experience, can you take the tech debt riddled code, and ask claude to come up with an entirely new version that fixes the tech debt/design issues you've identified? Presumably there's a set of tests that you'd keep the same, but you could leverage the power of ai in greenfield scenarios to just do a rewrite (while letting it see the old code). I dont know how well this would work, i havn't got to the heavy tech debt stage in any of my projects as I do mostly prototyping. I'd be interested in others thoughts.

I built an inventory tracking system as an exercise in "vibe coding" recently. I built a decent spec in conversation with Claude, then asked it to build it. It was kind of amazing - in 2 hours Claude churned out a credible looking app.

It looked really good, but as I got into the details the weirdness really started coming out. There's huge functions which interleave many concepts, and there's database queries everywhere. Huge amounts of duplication. It makes it very hard to change anything without breaking something else.

You can of course focus on getting the AI to simplify and condense. But that requires a good understanding of the codebase. Definitely no longer vibe-coded.

My enthusiasm for the technology has really gone in a wave. From "WOW" when it churned out 10k lines of credible looking code, to "Ohhhh" when I started getting into the weeds of the implementation and realising just how much of a mess it was. It's clearly very powerful for quick and dirty prototypes (and it seems to be particularly good at building decent CRUD frontends), but in software and user interaction the devil is in the details. And the details are a mess.

At the moment, good code structure for humans is good code structure for AIs and bad code structure for humans is still bad code structure for AIs too. At least to a first approximation.

I qualify that because hey, someone comes back and reads this 5 years later, I have no idea what you will be facing then. But at the moment this is still true.

The problem is, people see the AIs coding, I dunno, what, a 100 times faster minimum in terms of churning out lines? And it just blows out their mental estimation models and they substitute an "infinity" for the capability of the models, either today or in the future. But they are not infinitely capable. They are finitely capable. As such they will still face many of the same challenges humans do... no matter how good they get in the future. Getting better will move the threshold but it can never remove it.

There is no model coming that will be able to consume an arbitrarily large amount of code goop and integrate with it instantly. That's not a limitation of Artificial Intelligences, that's a limitation of finite intelligences. A model that makes what we humans would call subjectively better code is going to produce a code base that can do more and go farther than a model that just hyper-focuses on the short-term and slops something out that works today. That's a continuum, not a binary, so there will always be room for a better model that makes better code. We will never overwhelm bad code with infinite intelligence because we can't have the latter.

Today, in 2026, providing the guidance for better code is a human role. I'm not promising it will be forever, but it is today. If you're not doing that, you will pay the price of a bad code base. I say that without emotion, just as "tech debt" is not always necessarily bad. It's just a tradeoff you need to decide about, but I guarantee a lot of people are making poor ones today without realizing it, and will be paying for it for years to come no matter how good the future AIs may be. (If the rumors and guesses are true that Windows is nearly in collapse from AI code... how much larger an object lesson do you need? If that is their problem they're probably in even bigger trouble than they realize.)

I also don't guarantee that "good code for humans" and "good code for AIs" will remain as aligned as they are now, though it is my opinion we ought to strive for that to be the case. It hasn't been talked about as much lately, but it's still good for us to be able to figure out why a system did what it did and even if it costs us some percentage of efficiency, having the AIs write human-legible code into the indefinite future is probably still a valuable thing to do so we can examine things if necessary. (Personally I suspect that while there will be some efficiency gain for letting the AIs make their own programming languages that I doubt it'll ever be more than some more-or-less fixed percentage gain rather than some step-change in capability that we're missing out on... and if it is, maybe we should miss out on that step-change. As the moltbots prove that whatever fiction we may have told ourselves about keeping AIs in boxes is total garbage in a world where people will proactively let AIs out of the box for entertainment purposes.)

Perhaps it depends on the nature of the tech-debt. A lot of the software we create has consequences beyond a paticular codebase.

Published APIs cannot be changed without causing friction on the client's end, which may not be under our control. Even if the API is properly versioned, users will be unhappy if they are asked to adopt a completely changed version of the API on a regular basis.

Data that was created according to a previous version of the data model continues to exist in various places and may not be easy to migrate.

User interfaces cannot be radically changed too frequently without confusing the hell out of human users.

> ask claude to come up with an entirely new version that fixes the tech debt/design issues you've identified?

I haven't tried that yet, so not sure.

Once upon a time I was at a company where the PRD specified that the product needs to have a toggle to enable a certain feature temporarily. Engineering implemented it literally, it worked perfectly. But it was vital to be able to disable the feature, which should've been obvious to anyone. Since the PRD didn't mention that, it was not implemented.

In that case, it was done as a protest. But AI is kind of like that, although out of sheer dumbness.

The story is meant to say that with AI it is imperative to be extremely prescriptive about everything, or things will go haywire. So doing a full rewrite will probably work well, only if you manage to have very tight test case coverage for absolutely everything. Which is pretty hard.

Take Claude Code itself. It's got access to an endless amount of tokens and many (hopefully smart) engineers working on it and they can't build a fucking TUI with it.

So, my answer would be no. Tech debt shows up even if every single change made the right decisions and this type of holistic view of projects is something AIs absolutely suck at. They can't keep all that context in their heads so they are forever stuck in the local maxima. That has been my experience at least. Maybe it'll get better... any day now!

> Isn't this true of any greenfield project?

Sometimes the start of a greenfield project has a lot of questions along the lines of "what graph plotting library are we going to use? we don't want two competing libraries in the same codebase so we should check it meets all our future needs"

LLMs can select a library and produce a basic implementation while a human is still reading reddit posts arguing about the distinction between 'graphs' and 'charts'.

From personal experience I'd like to add the last 5% take 95% of the time - at least if you are working on a make over of an old legacy system.

I find AI great for just greasing the wheels, like if I’m overthinking on a problem or just feel too tired to start on something I know needs doing.

The solutions also help me combat my natural tendency to over-engineer.

It’s also fun getting ChatGPT to quiz me on topics.

I have experienced much of the opposite. With an established code base to copy patterns from, AI can generate code that needs a lot less iteration to clean up than on green fields projects.

I solve this problem by pointing Claude at existing code bases when I start a project, and tell it to use that approach.

That's a fair observation, there's probably a sweet spot. The difference I've found is that I can reliably keep the model on track with patterns through prompting and documentation if the code doesn't have existing examples, whereas I can't document every single nuance of a big codebase and why it matters.

Yeah, my observation is that for my usual work, I can maybe get a 20% productivity boot, probably closer to 10% tbh, and for the whole team overall productivity it feels like it has done nothing, as senior use their small productivity gains to fix the tons of issues in PR (or in prod when we miss something).

But last week I had two days where I had no real work to do, so I created cli tools to help with organisation, and cleaning up, I think AI boosted my productivity at least 200%, if not 500.

It seems to be fantastic up to about 5k loc and then it starts to need a lot more guidance, careful supervision, skepticism, and aggressive context management. If you’re careful, it only goes completely off the rails once in a while and the damage is only a lost hour or two.

Overall, still a 4x production gain overall though, so I’m not complaining for $20 a month. It’s especially good at managing complicated aspects of c so I can focus on the bigger picture rather than the symbol contortions.

Yes, I see the same thing. My working thesis is that if I can keep the codebase modular and clear seperations, so I keep the entire context, while claude code only need to focus on one module at a time, I can keep up the speed and quality. But if I try and give it tasks that cover the entire codebase it will have issues, no matter how you manage context and give directions. And again, this is not suprising, humans do the same, they need to break the task apart into smaller piecers. Have you found the same?

Yes. Spot on. The good thing is that it makes better code if modularity is strict as well.

I’m finding that I am breaking projects down into clear separations of concerns and designing inviolate API walls between modules, where before I might have reached into the code with less clearly defined internal vs external functions.

Exercising solid boundaries and being maniacal about the API surface is also really liberating personally, less cognitive load, less stress, easier tests, easier debugging.

Of course none of this is new, but now we can do it and get -more- done in a day than if we don’t. Building in technical debt no longer raises productivity, it lowers it.

If you are a competent engineer, ai can drastically improve both code quality and productivity, but you have to be capable of cognitively framing the project in advance (which can also be accelerated with ai). You need to work as an architect more than a coder.

It’s fantastic to be able to prototype small to medium complexity projects, figure what architects work and don’t, then build on a stable foundation.

That’s what I’ve been doing lately, and it really helps get a clean architecture at the end.

I’ve done this in pure Python for a long time. Single file prototype that can mostly function from the command line. The process helps me understand all the sub problems and how they relate to each other. Best example is when you realize behaviors X, Y, and Z have so much in common that it makes sense to have a single component that takes a parameter to specify which behavior to perform. It’s possible that already practicing this is why I feel slightly “meh” compared to others regarding GenAI.

Similar experience. I love using Gemini to set up my home server, it can debug issues and generate simple docker compose files faster than I could have done myself. But at work on the 10 year old Rails app, I find it so much easier to just write all the code myself than to work out what prompt would work and then review/modify the results.

This makes me think how AI turns SW development upside down. In traditonal development we write code which is the answer to our problems. With AI we write questions and get the answers. Neither is easy, finding the correct questions can be a lot fo work, whereas if you have some existing code you already have the answers, but you may not have the questions (= "specs") written down anywhere, at least not very well, typically.

At least in my experience the AI agents work best when you give them a description of a concrete code change like "Write a function which does this here" rather than vague product ideas like "The user wants this problem solved". But coming up with the prompts for an exact code change is often harder than writing the code.

This is where engineering practices help. Based on 1.5 years data from my team I can say that I see about 30% performance increase on mature system (about 9 years old code base), maybe more. The interesting stuff - LLMs is leverage, the better engineer you are the more you benefit from LLM.

All of this speedrun hits a wall at the context window. As long as the project fits into 200k tokens, you’re flying. The moment it outgrows that, productivity doesn’t drop by 20% - it drops to zero. You start spending hours explaining to the agent what you changed in another file that it has already forgotten. Large organizations win in the long run precisely because they rely on processes that don’t depend on the memory of a single brain - even an electronic one

This reads as if written by someone who has never used these tools before. No one ever tries to "fit" the entire project into a single context window. Successfully using coding LLMs involves context management (some of which is now done by the models themselves) so that you can isolate the issues you're currently working on, and get enough context to work effectively. Working on enormous codebases over the past two months, I have never had to remind the model what it changed in another file, because 1) it has access to git and can easily see what has changed, and 2) I work with the model to break down projects into pieces that can be worked on sequentially. And keep in mind, this the worst this technology will ever be - it will only get larger context windows and better memory from here.

What are the SOTA methods for context management assuming the agent runs with its tool calls without any break? Do you flush GPU tokens/adjust KV caches when you need to compress context by summarizing/logging some part?

Everyone I know who is using AI effectively has solved for the context window problem in their process. You use design, planning and task documents to bootstrap fresh contexts as the agents move through the task. Using these approaches you can have the agents address bigger and bigger problems. And you can get them to split the work into easily reviewable chunks, which is where the bottleneck is these days.

Plus the highest end models now don’t go so brain dead at compaction. I suspect that passing context well through compaction will be part of the next wave of model improvements.

I call it the day50 problem, coined that about a year ago. I've been building tools to address it since then. Quit the dayjob 7 months ago and have been doing it full time since

https://github.com/day50-dev/

I have been meaning to put up a blog ...

Essentially there's a delta between what the human does and the computer produces. In a classic compiler setting this is a known, stable quantity throughout the life-cycle of development.

However, in the world of AI coding this distance increases.

There's various barriers that have labels like "code debt" where the line can cross. There's three mitigations now. Start the lines closer together (PRD is the current en vogue method), push out the frontier of how many shits someone gives (this is the TDD agent method), try to bend the curve so it doesn't fly out so much (this is the coworker/colleague method).

Unfortunately I'm just a one-man show so the fact that I was ahead and have working models to explain this has no rewards because you know, good software is hard...

I've explained this in person at SF events (probably about 40-50 times) so much though that someone reading this might have actually heard it from me...

If that's the case, hi, here it is again.

I find that setting up proper structure while everything still fits in a single context window of Claude code, as well as splittjng as much as possible into libraries works pretty well for staving off that moment.

My observations match this. I can get fresh things done very quickly, but when I start getting into the weeds I eventually get too frustrated with babysitting the LLM to keep using it.

A friendly counterpoint: my test of new models and agentic frameworks is to copy one of my half a zillion old open source git repos for some usually very old project or experiment - I then see how effective the new infra is for refactoring and cleaning up my ancient code. After testing I usually update the old projects and I get the same warm fuzzy feeling as I do from spring cleaning my home.

I also like to generate greenfield codebases from scratch.

> Anyone can create a small version of anything

Yup. My biggest issue with designing software is usually designing the system architecture/infra. I am very opposed to just shove everything to AWS and call it a day, you dont learn anything from that, cloud performance stinks for many things and I dont want to get random 30k bills because I let some instance of something run accidentally.

AI sucks at determining what kinda infrastructure would be great for scenario x due to Cloud being to go to solution for the lazy dev. Tried to get it to recommend a way to self host stuff, but thats just a general security hazard.

[dead]