> sounds like alot of work and expense for something that is meant to make programming easier and cheaper.

Not if you are an AI gold rush shovel salesman.

From the article:

> I've run Claude Code workshops for over 100 engineers in the last six months

Yeah, my colleague recently said "hey I've burnt through $200 in Claude in 3 days". And he was prompting. Max 8hrs/day Imagine what would happen if AI was prompting.

As I like this allegory really much, AI is (or should be) like and exoskeleton, should help people do things. If you step out of your car putting it first in drive mode, and going to sleep, next day it will be farther, but the question is, is it still on road

Burnt through 4 Max x20 in a week here. Throughput isn't the bottleneck anymore. Review quality is. The 1-in-5 error rate in this thread matches my experience. More agents overnight just means more review tomorrow morning.

What moved the needle: capturing architectural context (ADRs, structured system prompts, skill files) that agents reference before making changes. Each session builds on prior decisions. The agent improves because the context compounds. Better context beat more parallelism every time.

This matches what I've found running persistent agents. The compounding context is the whole game.

The pattern that works: treat your agent's workspace like infrastructure, not a scratch pad. ADRs, skill files, structured memory of past decisions - all of it becomes the equivalent of institutional knowledge that a senior engineer carries in their head. Except it survives session restarts.

The article's TDD framing gets at something important too. The acceptance criteria aren't just verification - they're context. When you write "after 5 failed attempts, login blocked for 60 seconds" before the agent touches code, you've constrained the solution space dramatically. The agent isn't guessing what you want anymore.

Where I think the article undersells the problem: spec misunderstandings compound too. If your architectural context has a wrong assumption baked in, every agent session inherits that assumption. You need periodic human review of the context itself, not just the outputs. The ADRs need auditing the same way code does.

https://github.com/safety-quotient-lab/psychology-agent <- I've been exploring ways to track decisions, making some interesting findings, at the homelab scale, at least.

The cognitive architecture, so to speak, for the LLM can make a huge difference - triggers and skills go a long way when combined with shell scripts that dual-write.

This comment reads very strongly like it was written by an LLM.

Your sibling even more so.

[deleted]

Agreed. The spec file is context. Writing acceptance criteria before you prompt provides the context the agent needs to not go off in the wrong direction. Human leverage just moved up and the plan/spec is the most important step.

Parallelism on top of bad context just gets you more wrong answers faster

Sorry but isn't the bottleneck then simply to do even relevant things? Like how much of a qualified backlog do you have that your pipeline does not run dry?

Reminds me of when I was looking for Obsidian note management workflows and every single person who posted about theirs used it to take notes on... note taking workflows.

Bingo.