LLMs are at their best when you have an expectation for their output. I generally know the shape of the correct response and that allows me to evaluate it's output on it's "vibes", rather than line by line. If there's no expectation then I have to take everything at face value and now I'm at the mercy of the machine.

Exactly, if I generate a large chunk software, I'm going to have expectations about what it will do, how it will do it, etc. You don't just accept the statement that "it's done" for fact but you start looking for evidence.

A scientific approach here is to look to falsify the statement. You start asking questions, running tests, experiments, etc. to prove the notion that it is done wrong. And at some point you run out of such tests and it's probably done for some useful notion of done-ness.

I've built some larger components and things with AI. It's never a one shot kind of deal. But the good news is that you can use more AI to do a lot of the evaluation work. And if you align your agents right, the process kind of runs itself, almost. Mostly I just nudge it along. "Did you think about X? What about Y? Let's test Z"

> Mostly I just nudge it along. "Did you think about X? What about Y? Let's test Z"

Exactly - you need to constantly have your sceptics glasses on and you need to be exacting in terms of the structure you want things to follow. Having and enforcing "taste" is important and you need to be willing to spend time on that phase because the quality of the payoff entirely depends on it.

I recently planned for a major refactor. The discussion with claude went on for almost two days. The actual implementation was done in 10 minutes. It probably has made some mistakes that I will have to check for during the review but given that the level of detail that plan document had, it is certainly 90-95% there. After pouring-in of that much opinion, it is a fairly good representation of what I would have written while still being faster than me doing everything by hand.

So you have to know the answer and also be an expert in the problem domain?

In my experience you need exactly what you said, and I would add that he probably would have spent half day to do the refactoring himself and it would be sure he did right.

I can speak towards building large-scale systems from scratch with these tools. I've been working since late last year on a project that was barely a tech demo, and the progression of development on that project has seen me go from leveraging co-pilot autocomplete at the start, to full-on vibecoding 100% of the new additions.

I have reasonable eng chops I'd like to think - I have been a senior IC for a while on a reasonably diverse set of challenging systems problems and built out some pretty large-scale pieces of software the old "artisinal" way.

This particular project is a productization of some ideas I had for leveraging a virtual machine to execute high-divergence parallel logic on GPUs, in an effort to move complex things like "unit behaviour in games" (the classical symbolic kind, not NN-based unit behaviour) into the GPU. The project is going well but still quite a ways from release. But it's at about 300k lines of code now across 9 or so rust repositories, and a smattering of typescript on the frontend.

I have had stumbles, but overall I feel I have put together some good strategies and principles for pushing large projects along with these tools in an effective way.

The biggest takeaway for me is that the "feel" is different. Software construction by hand felt like building legos where you put the pieces together yourself. A lot of my focus would be on building and solidifying core components so I could rely on them when I stepped up to build higher-level components. Projects would get mired quickly if you didn't solidify your base.

With agentic development, one of the early challenges I ran into was this issue with something I'll call "oversight inception". It's when at some early point in the process a somewhat low-importance decision is made - an implementation decision, a decision to say.. align a test with the implementation rather than an implementation with a test.

Then, as you build more on top of this, that small decision somehow ends up getting reified into a core architectural policy that then cascades up.

You realize that when you're building a big project, the focus on some particular component is backstopped by a general understanding of local development directionality with respect to the larger level project. And the agent has no idea of directionality.

So small chinks in the design end up getting magnified and blown up as the dev process proceeds, and later on review you find major architectural pieces have just been overlooked, all flowing from some small incidental implementation choice a long time before.

This is one among a number of issues, but it's a big one. Once I saw it happening I tried an approach to mitigate it by developing a set of golden "goal" documents that describe directionality at the project level: what you are working towards and what design components need to exist.

This doesn't eliminate the "oversight inception" issue, but it does catch them earlier.

When I started applying the goal documentation aggressively to re-align the project implementation direction, I found velocity dropped a lot.

And as I progress, I'm balancing this out a bit - to allow the system to diverge a bit, but force reconvergence towards the goals at some specific cadence. I haven't found the right candence yet but I'm getting there.

This new style of development feels more like claymoulding pottery than lego assembly. You sort of "get it into shape". It's a very interesting new set of process assumptions.

I agree, but I would add that they can be very useful even if you do not have clear expectations but have some solid ways to verify their claims. Often in doing this verification I came up with new ideas.