Models aren’t reliable, and it’s a bottleneck.
My solution was to write code to force the model down a deterministic path.
It’s open source here: https://codeleash.dev
It’s working! ~200k LOC python/typescript codebase built from scratch as I’ve grown out the framework. I probably wrote 500-1000 lines of that, so ~99.5% written by Claude Code. I commit 10k-30k loc per week, code-reviewed and industrial strength quality (mainly thanks to rigid TDD)
I review every line of code but the TDD enforcement and self-reflection have now put both the process and continual improvement to said process more or less on autopilot.
It’s a software factory - I don’t build software any more, I walk around the machine with a clipboard optimizing and fixing constraints. My job is to input the specs and prompts and give the factory its best chance of producing a high quality result, then QA that for release.
I keep my operational burden minimal by using managed platforms - more info in the framework.
One caveat; I am a solo dev; my cofounder isn’t writing code. So I can’t speak to how it is to be in a team of engineers with this stuff.
My most productive day last week was a net of -10k lines (yes, minus ten thousand).
No AI used.
Congratulations, honestly, but I would not do that for a job.
Metaphorically speaking, you’re out there sprinting on the road while people who’ve made agentic coding work for them are sipping coffee in a limo.
People who haven’t made agentic coding work (but do it anyway) are sipping coffee in the back of a limo that has no brakes. No thanks to that.
You have a 200K LOC repository and you haven’t written 99.5% of it?
It was generated for me in accordance with the architecture and constraints I defined for the agent; and I’ve reviewed every line.
TDD really is that good.
How many pages of architecture / constraints did you write? I guess I’m curious what type of text input renders 200K lines of code output. It must be a similar level of tokens in just docs / prompting. Have you verified all of that? Was that AI generated?
Would be very interested to see whether it’s not just… regular LLM snowballing a paragraph into 12 pages of “technical design documents” and 10K lines of code. Not sure what kind of niche you’re in or what the business logic is, but it sounds to me like you’ve built a machine that… generates code you don’t need to look at??
There was a 200 word architectur doc that lasted about 3 weeks before it drifted so it got deleted. I no longer keep architecture docs - tests and code are enough for the agent to answer questions when we have them.
Probably wrote 2000+ words of prompts per day to the agent, Monday to Friday, for like 9 months. Dozens to hundreds of prompts a day back and forth with anywhere from 1-7 concurrent agents at a time.
This is not something anyone would ever one-shot. There are thousands of commits. My commit log looks like a normal squash-merge-to-main-and-deploy workflow.