> Vibe coding would be catastrophic here. Not because the AI can't write the code - it usually can - but because the failure mode is invisible. A hallucinated edge case in a tax calculation doesn't throw an error. It just produces a slightly wrong number that gets posted to a real accounting platform and nobody notices until the accountant does their review.
How is that different from handwritten code ? Sounds like stuff you deal with architecturally (auditable/with review/rollback) and with tests.
It’s shocking to me that people even ask this type of question. How do you not see the difference between a machine that will hallucinate something random if it doesn’t know the answer vs a human that will logic through things and find the correct answer.
Because I've seen the results ? Failure mode of LLMs are unintuitive, and the ability to grasp the big picture is limited (by context mostly I'd say), but I find CC to follow instructions better than 80% of people I've worked with. And the amount of mental stamina it would take to grok that much context even when you know the system vs. what these systems can do in minutes.
As for the hallucinations - you're there to keep the system grounded. Well the compiler is, then tests, then you. It works surprisingly well if you monitor the process and don't let LLM wander off when it gets confused.
Because humans also make stupid random mistakes, and if your test suite and defensive practices don't catch it, the only difference is the rate of errors.
It may be that you've done the risk management, and deemed the risk acceptable (accepting the risk, in risk management terms) with human developers and that vibecoding changes the maths.
But that is still an admission that your test suite has gaping holes. If that's been allowed to happen consciously, recorded in your risk register, and you all understand the consequences, that can be entirely fine.
But the problem then isn't reflecting a problem with vibe coding, but a risk management choice you made to paper over test suite holes with an assumed level of human dilligence.
> How do you not see the difference between a machine that will hallucinate something random if it doesn’t know the answer vs a human...
Your claim here is that humans can't hallucinate something random. Clearly they can and do.
> ... that will logic through things and find the correct answer.
But humans do not find the correct answer 100% of the time.
The way that we address human fallibility is to create a system that does not accept the input of a single human as "truth". Even these systems only achieve "very high probability" but not 100% correctness. We can employ these same systems with AI.
Almost all current software engineering practices and projects rely on humans doing ongoing "informal" verification. The engineers' knowledge is integral part of it and using LLMs exposes this "vulnerability" (if you want to call it that). Making LLMs usable would require such a degree of formalization (of which integration and end-to-end tests are a part), that entire software categories would become unviable. Nobody would pay for an accounting suite that cost 10-20x more.
Which interestingly is the meat of this article. The key points aren’t that “vibe coding is bad” but that the design and experience of these tools is actively blinding and seductive in a way that impairs ability to judge effectiveness.
Basically, instead of developers developing, they've been half-elevated to the management class where they manage really dumb but really fast interns (LLM's).
But they dont get the management pay, and they are 100% responsible for the LLMs under them. Whereas real managers get paid more and can lay blame and fire people under them.
Humans who fail to do so find the list of tasks they’re allowed to do suddenly curtailed. I’m sure there is a degree of this with LLMs but the fanboys haven’t started admitting it yet.
> It’s shocking to me that people even ask this type of question. How do you not see the difference between a machine that will hallucinate something random if it doesn’t know the answer vs a human that will logic through things and find the correct answer.
I would like to work with the humans you describe who, implicitly from your description, don't hallucinate something random when they don't know the answer.
I mean, I only recently finished dealing with around 18 months of an entire customer service department full of people who couldn't comprehend that they'd put a non-existent postal address and the wrong person on the bills they were sending, and this was therefore their own fault the bills weren't getting paid, and that other people in their own team had already admitted this, apologised to me, promised they'd fixed it, while actually still continuing to send letters to the same non-existent address.
Don't get me wrong, I'm not saying AI is magic (at best it's just one more pair of eyes no matter how many models you use), but humans are also not magic.
Humans are accountable to each other. Humans can be shamed in a code review and reprimanded and threatened with consequences for sloppy work. Most,
humans once reprimanded , will not make the same kind of mistake twice.
> Humans can be shamed in a code review and reprimanded and threatened with consequences for sloppy work.
I had to not merely threaten to involve the Ombudsman, but actually involve the Ombudsman.
That was after I had already escalated several times and gotten as far as raising it with the Data Protection Officer of their parent company.
> Most, humans once reprimanded , will not make the same kind of mistake twice.
To quote myself:
other people in their own team had already admitted this, apologised to me, promised they'd fixed it, while actually still continuing to send letters to the same non-existent address.
> How do you not see the difference between a machine that will hallucinate something random if it doesn’t know the answer vs a human that will logic through things and find the correct answer.
I see this argument over and over agin when it comes to LLMs and vibe coding. I find it a laughable one having worked in software for 20 years. I am 100% certain the humans are just as capable if not better than LLMs at generating spaghetti code, bugs, and nonsensical errors.
It's shocking to me that people make this claim as if humans, especially in some legacy accounting system, would somehow be much better at (1) recognizing their mistakes, and (2) even when they don't, not fudge-fingering their implementation. Like the criticisms of agents are valid, but the incredulity that they will ever be used in production or high risk systems to me is just as incredible. Of course they will -- where is Opus 4.6 compared to Sonnet 4? We've hit an inflection point where replacing hand coding with an agent and interacting only via prompt is not only doable, highly skilled people are already routinely doing it. Companies are already _requiring_ that people do it. We will then hit an inflection point at some time soon where the incredulity at using agents even in the highest stakes application will age really really poorly. Let's see!
Your point is the speculative one, though. We know humans can and have built incredibly complex and reliable systems. We do not have the same level of proof for LLMs.
Claims like your should wait at least 2-3 years, if not 5.
That is also speculative. Well let's just wait and see :) but the writing is on the wall. If your criticism is where we're at _now_ and whether or not _today_ you should be vibe coding in highly complex systems I would say: why not? as long as you hold that code to the same standard as human written code, what is the problem? If you say "well reviews don't catch everything" ok but the same is true for humans. Yes large teams of people (and maybe smaller teams of highly skilled people) have built wonderfully complex systems far out of reach of today's coding agents. But your median programmer is not going to be able to do that.
Your comment is shocking to me. AI coding works. I have seen it with my own eyes last week and today.
I can therefore only assume that you have not coded with the latest models. If you experiences are with GPT 4o or earlier all you have only used the mini or light models, then I can totally understand where you’re coming from. Those models can do a lot, but they aren’t good enough to run on their own.
The latest models absolutely are I have seen it with my own eyes. Ai moves fast.
I think the point he is trying to make is that you can't outsource your thinking to a automated process and also trust it to make the right decisions at the same time.
In places where a number, fraction, or a non binary outcome is involved there is an aspect of growing the code base with time and human knowledge/failure.
You could argue that speed of writing code isn't everything, many times being correct and stable likely is more important. For eg- A banking app, doesn't have be written and shipped fast. But it has to be done right. ECG machines, money, meat space safety automation all come under this.
Replace LLM with employee in your argument - what changes ? Unless everyone at your workplace owns the system they are working on - this is a very high bar and maybe 50% of devs I've worked with are capable of owning a piece of non trivial code, especially if they didn't write it.
Realiy is you don't solve these problems by to relying on everyone to be perfect - everyone slips up - to achieve results consistently you need process/systems to assure quality.
Safety critical system should be even better equipped to adopt this because they already have the systems to promote correct outputs.
The problem is those systems weren't built for LLMs specifically so the unexpected failure cases and the volume might not be a perfect fit - but then you work on adapting the quality control system.
>>replace LLM with employee in your argument - what changes ?
I mentioned this part in my comment. You cannot trust an automated process to a thing, and expect the same process to verify if it did it right. This is with regards to any automated process, not just code.
This is not the same as manufacturing, as in manufacturing you make the same part thousands of times. In code the automated process makes a specific customised thing only once, and it has to be right.
>>The problem is those systems weren't built for LLMs specifically so the unexpected failure cases ...
We are not talking of failures. There is a space between success and failure where the LLM can go into easily.
That's not what I get out of the comment you are replying to.
In the case being discussed here, one of code matching the tax code, perfection is likely possible; perfection is defined by the tax code. The SME on this should be writing the tests that demonstrate adhering with the tax code. Once they do that, then it doesn't matter if they, or the AI, or a one shot consultant write it, as far as correctness goes.
If the resulting AI code has subtle bugs in it that pass the test, the SME likely didn't understand the corner cases of this part of the tax code as well as they thought, and quite possibly could have run into the same bugs.
That's what I get out of what you are replying to.
With handwritten code, the humans know what they don’t know. If you want some constants or some formula, you don’t invent or guess it, you ask the domain expert.
Let's put it this way: the human author is capable of doing so. The LLM is not. You can cultivate the human to learn to think in this way. You can for a brief period coerce an LLM to do so.
Humans make such mistakes slowly. It's much harder to catch the "drift" introduced by LLM because it happens so quickly and silently. By the time you notice something is wrong, it has already become the foundation for more code. You are then looking at a full rewrite.
The rate of the mistakes versus the rate of consumers and testers finding them was a ratio we could deal with and we don’t have the facilities to deal with the new ratio.
It is likely over time that AI code will necessitate the use of more elaborate canary systems that increase the cost per feature quite considerably. Particularly for small and mid sized orgs where those costs are difficult to amortize.
If the failure mode is invisible, that is a huge risk with human developers too.
Where vibecoding is a risk, it generally is a risk because it exposes a systemic risk that was always there but has so far been successfully hidden, and reveals failing risk management.
i agree, and its strange that this failure mode continually gets lumped onto AI. The whole point of longer term software engineering was to make it so that the context within a particular persons head should not impact the ability of a new employee to contribute to a codebase. turns out everything we do to make sure that is the case for a human also works for an agent.
As far as i can tell, the only reason AI agents currently fail is because they dont have access to the undocumented context inside of peoples heads and if we can just properly put that in text somehwere there will be no problems.
The failure mode is getting lumped into AI because AI is a lot more likely to fail.
We've done this with Neural Networks v1, Expert Systems, Neural Networks v2, SVM, etc, etc. only a matter of time before we figured it out with deep neural networks. Clearly getting closer with every cycle, but no telling how many cycles we have left because there is no sound theoretical framework.
At the same time, we have spent a large part of the existence of civilisation figuring out organisational structures and methods to create resilient processes using unreliable humans, and it turns out a lot of those methods also work on agents. People just often seem miffed that they have to apply them on computers too.
It doesn't seem obvious that it's a problem for LLM coders to write their own tests (if we assume that their coding/testing abilities are up to snuff), given human coders do so routinely.
This thread is talking about vibe coding, not LLM-assisted human coding.
The defining feature of vibe coding is that the human prompter doesn't know or care what the actual code looks like. They don't even try to understand it.
You might instruct the LLM to add test cases, and even tell it what behavior to test. And it will very likely add something that passes, but you have to take the LLM's word that it properly tests what you want it to.
The issue I have with using LLM's is the test code review. Often the LLM will make a 30 or 40 line change to the application code. I can easily review and comprehend this. Then I have to look at the 400 lines of generated test code. While it may be easy to understand there's a lot of it. Go through this cycle several times a day and I'm not convinced I'm doing a good review of the test code do to mental fatigue, who knows what I may be missing in the tests six hours into the work day?
> This thread is talking about vibe coding, not LLM-assisted human coding.
I was writing about vibe-coding. It seems these guys are vibe-coding (https://factory.strongdm.ai/) and their LLM coders write the tests.
I've seen this in action, though to dubious results: the coding (sub)agent writes tests, runs them (they fail), writes the implementation, runs tests (repeat this step and last until tests pass), then says it's done. Next, the reviewer agent looks at everything and says "this is bad and stupid and won't work, fix all of these things", and the coding agent tries again with the reviewer's feedback in mind.
Models are getting good enough that this seems to "compound correctness", per the post I linked. It is reasonable to think this is going somewhere. The hard parts seem to be specification and creativity.
Maybe it’s just the people I’m around but assuming you write good tests is a big assumption. It’s very easy to just test what you know works. It’s the human version of context collapse, becoming myopic around just what you’re doing in the moment, so I’d expect LLMs to suffer from it as well.
> the human version of context collapse, becoming myopic around just what you’re doing in the moment
The setups I've seen use subagents to handle coding and review, separately from each other and from the "parent" agent which is tasked with implementing the thing. The parent agent just hands a task off to a coding agent whose only purpose is to do the task, the review agent reviews and goes back and forth with the coding agent until the review agent is satisfied. Coding agents don't seem likely to suffer from this particular failure mode.
I have zero issues with things going sideways on even the most complicated task. I don't understand why people struggle so much, it's easy to get it to do the right thing without having to hand hold you just need to be better at what you're asking for.
Not necessarily. Double entry bookkeeping catches errors in cases where an amount posted to one account does not have an equally offsetting post in another account or accounts (i.e., it catches errors when the books do not balance). It would not on its own catch errors where the original posted amount is incorrect due to a mistaken assumption, or if the offset balances but is allocated incorrectly.
> Vibe coding would be catastrophic here. Not because the AI can't write the code - it usually can - but because the failure mode is invisible. A hallucinated edge case in a tax calculation doesn't throw an error. It just produces a slightly wrong number that gets posted to a real accounting platform and nobody notices until the accountant does their review.
How is that different from handwritten code ? Sounds like stuff you deal with architecturally (auditable/with review/rollback) and with tests.
It’s shocking to me that people even ask this type of question. How do you not see the difference between a machine that will hallucinate something random if it doesn’t know the answer vs a human that will logic through things and find the correct answer.
Because I've seen the results ? Failure mode of LLMs are unintuitive, and the ability to grasp the big picture is limited (by context mostly I'd say), but I find CC to follow instructions better than 80% of people I've worked with. And the amount of mental stamina it would take to grok that much context even when you know the system vs. what these systems can do in minutes.
As for the hallucinations - you're there to keep the system grounded. Well the compiler is, then tests, then you. It works surprisingly well if you monitor the process and don't let LLM wander off when it gets confused.
Because humans also make stupid random mistakes, and if your test suite and defensive practices don't catch it, the only difference is the rate of errors.
It may be that you've done the risk management, and deemed the risk acceptable (accepting the risk, in risk management terms) with human developers and that vibecoding changes the maths.
But that is still an admission that your test suite has gaping holes. If that's been allowed to happen consciously, recorded in your risk register, and you all understand the consequences, that can be entirely fine.
But the problem then isn't reflecting a problem with vibe coding, but a risk management choice you made to paper over test suite holes with an assumed level of human dilligence.
> How do you not see the difference between a machine that will hallucinate something random if it doesn’t know the answer vs a human...
Your claim here is that humans can't hallucinate something random. Clearly they can and do.
> ... that will logic through things and find the correct answer.
But humans do not find the correct answer 100% of the time.
The way that we address human fallibility is to create a system that does not accept the input of a single human as "truth". Even these systems only achieve "very high probability" but not 100% correctness. We can employ these same systems with AI.
> The way that we address human fallibility is to create a system that does not accept the input of a single human as "truth".
I think you just rejected all user requirement and design specs.
Not sure how things work at your company, but I’ve never seen a design spec that doesn’t have input from many humans on some form or another
We're agreeing, I think.
Almost all current software engineering practices and projects rely on humans doing ongoing "informal" verification. The engineers' knowledge is integral part of it and using LLMs exposes this "vulnerability" (if you want to call it that). Making LLMs usable would require such a degree of formalization (of which integration and end-to-end tests are a part), that entire software categories would become unviable. Nobody would pay for an accounting suite that cost 10-20x more.
I know right, had this discussion last week and it's difficult to argue when people are blinded by the "magic" and hype of the slot machine.
Which interestingly is the meat of this article. The key points aren’t that “vibe coding is bad” but that the design and experience of these tools is actively blinding and seductive in a way that impairs ability to judge effectiveness.
Basically, instead of developers developing, they've been half-elevated to the management class where they manage really dumb but really fast interns (LLM's).
But they dont get the management pay, and they are 100% responsible for the LLMs under them. Whereas real managers get paid more and can lay blame and fire people under them.
I'd say more importantly, vs. human who on failing to find an acceptable answer, says so.
Humans who fail to do so find the list of tasks they’re allowed to do suddenly curtailed. I’m sure there is a degree of this with LLMs but the fanboys haven’t started admitting it yet.
> It’s shocking to me that people even ask this type of question. How do you not see the difference between a machine that will hallucinate something random if it doesn’t know the answer vs a human that will logic through things and find the correct answer.
I would like to work with the humans you describe who, implicitly from your description, don't hallucinate something random when they don't know the answer.
I mean, I only recently finished dealing with around 18 months of an entire customer service department full of people who couldn't comprehend that they'd put a non-existent postal address and the wrong person on the bills they were sending, and this was therefore their own fault the bills weren't getting paid, and that other people in their own team had already admitted this, apologised to me, promised they'd fixed it, while actually still continuing to send letters to the same non-existent address.
Don't get me wrong, I'm not saying AI is magic (at best it's just one more pair of eyes no matter how many models you use), but humans are also not magic.
Humans are accountable to each other. Humans can be shamed in a code review and reprimanded and threatened with consequences for sloppy work. Most, humans once reprimanded , will not make the same kind of mistake twice.
> Humans can be shamed in a code review and reprimanded and threatened with consequences for sloppy work.
I had to not merely threaten to involve the Ombudsman, but actually involve the Ombudsman.
That was after I had already escalated several times and gotten as far as raising it with the Data Protection Officer of their parent company.
> Most, humans once reprimanded , will not make the same kind of mistake twice.
To quote myself:
> How do you not see the difference between a machine that will hallucinate something random if it doesn’t know the answer vs a human that will logic through things and find the correct answer.
I see this argument over and over agin when it comes to LLMs and vibe coding. I find it a laughable one having worked in software for 20 years. I am 100% certain the humans are just as capable if not better than LLMs at generating spaghetti code, bugs, and nonsensical errors.
It's shocking to me that people make this claim as if humans, especially in some legacy accounting system, would somehow be much better at (1) recognizing their mistakes, and (2) even when they don't, not fudge-fingering their implementation. Like the criticisms of agents are valid, but the incredulity that they will ever be used in production or high risk systems to me is just as incredible. Of course they will -- where is Opus 4.6 compared to Sonnet 4? We've hit an inflection point where replacing hand coding with an agent and interacting only via prompt is not only doable, highly skilled people are already routinely doing it. Companies are already _requiring_ that people do it. We will then hit an inflection point at some time soon where the incredulity at using agents even in the highest stakes application will age really really poorly. Let's see!
Your point is the speculative one, though. We know humans can and have built incredibly complex and reliable systems. We do not have the same level of proof for LLMs.
Claims like your should wait at least 2-3 years, if not 5.
That is also speculative. Well let's just wait and see :) but the writing is on the wall. If your criticism is where we're at _now_ and whether or not _today_ you should be vibe coding in highly complex systems I would say: why not? as long as you hold that code to the same standard as human written code, what is the problem? If you say "well reviews don't catch everything" ok but the same is true for humans. Yes large teams of people (and maybe smaller teams of highly skilled people) have built wonderfully complex systems far out of reach of today's coding agents. But your median programmer is not going to be able to do that.
bored at the weekend, are you sama?
I unfortunately am not the droid you are looking for, I don't know a sama
[dead]
Your comment is shocking to me. AI coding works. I have seen it with my own eyes last week and today.
I can therefore only assume that you have not coded with the latest models. If you experiences are with GPT 4o or earlier all you have only used the mini or light models, then I can totally understand where you’re coming from. Those models can do a lot, but they aren’t good enough to run on their own.
The latest models absolutely are I have seen it with my own eyes. Ai moves fast.
>>How is that different from handwritten code ?
I think the point he is trying to make is that you can't outsource your thinking to a automated process and also trust it to make the right decisions at the same time.
In places where a number, fraction, or a non binary outcome is involved there is an aspect of growing the code base with time and human knowledge/failure.
You could argue that speed of writing code isn't everything, many times being correct and stable likely is more important. For eg- A banking app, doesn't have be written and shipped fast. But it has to be done right. ECG machines, money, meat space safety automation all come under this.
Replace LLM with employee in your argument - what changes ? Unless everyone at your workplace owns the system they are working on - this is a very high bar and maybe 50% of devs I've worked with are capable of owning a piece of non trivial code, especially if they didn't write it.
Realiy is you don't solve these problems by to relying on everyone to be perfect - everyone slips up - to achieve results consistently you need process/systems to assure quality.
Safety critical system should be even better equipped to adopt this because they already have the systems to promote correct outputs.
The problem is those systems weren't built for LLMs specifically so the unexpected failure cases and the volume might not be a perfect fit - but then you work on adapting the quality control system.
>>replace LLM with employee in your argument - what changes ?
I mentioned this part in my comment. You cannot trust an automated process to a thing, and expect the same process to verify if it did it right. This is with regards to any automated process, not just code.
This is not the same as manufacturing, as in manufacturing you make the same part thousands of times. In code the automated process makes a specific customised thing only once, and it has to be right.
>>The problem is those systems weren't built for LLMs specifically so the unexpected failure cases ...
We are not talking of failures. There is a space between success and failure where the LLM can go into easily.
One major difference is the code has an owner who might consider what needs a test or ask questions if they don't understand.
To argue that all work is fungible because perfection cannot be achieved is actually a pretty out there take.
Replace your thought experiment with "Is one shot consultant code different from expert code?" Yes. They are different.
Code review is good and needed for human code, right? But if its "vibe coded", suddenly its not important? The differences are clear.
>perfection cannot be achieved
That's not what I get out of the comment you are replying to.
In the case being discussed here, one of code matching the tax code, perfection is likely possible; perfection is defined by the tax code. The SME on this should be writing the tests that demonstrate adhering with the tax code. Once they do that, then it doesn't matter if they, or the AI, or a one shot consultant write it, as far as correctness goes.
If the resulting AI code has subtle bugs in it that pass the test, the SME likely didn't understand the corner cases of this part of the tax code as well as they thought, and quite possibly could have run into the same bugs.
That's what I get out of what you are replying to.
With handwritten code, the humans know what they don’t know. If you want some constants or some formula, you don’t invent or guess it, you ask the domain expert.
> With handwritten code, the humans know what they don’t know.
I find this often not to be the case at all.
Let's put it this way: the human author is capable of doing so. The LLM is not. You can cultivate the human to learn to think in this way. You can for a brief period coerce an LLM to do so.
True, but IMO irrelevant. "What could have been" (capabilities) is just another "if only..."
Humans make such mistakes slowly. It's much harder to catch the "drift" introduced by LLM because it happens so quickly and silently. By the time you notice something is wrong, it has already become the foundation for more code. You are then looking at a full rewrite.
The rate of the mistakes versus the rate of consumers and testers finding them was a ratio we could deal with and we don’t have the facilities to deal with the new ratio.
It is likely over time that AI code will necessitate the use of more elaborate canary systems that increase the cost per feature quite considerably. Particularly for small and mid sized orgs where those costs are difficult to amortize.
Or maybe this is a SaaS opportunity for someone.
If the failure mode is invisible, that is a huge risk with human developers too.
Where vibecoding is a risk, it generally is a risk because it exposes a systemic risk that was always there but has so far been successfully hidden, and reveals failing risk management.
i agree, and its strange that this failure mode continually gets lumped onto AI. The whole point of longer term software engineering was to make it so that the context within a particular persons head should not impact the ability of a new employee to contribute to a codebase. turns out everything we do to make sure that is the case for a human also works for an agent.
As far as i can tell, the only reason AI agents currently fail is because they dont have access to the undocumented context inside of peoples heads and if we can just properly put that in text somehwere there will be no problems.
The failure mode is getting lumped into AI because AI is a lot more likely to fail.
We've done this with Neural Networks v1, Expert Systems, Neural Networks v2, SVM, etc, etc. only a matter of time before we figured it out with deep neural networks. Clearly getting closer with every cycle, but no telling how many cycles we have left because there is no sound theoretical framework.
At the same time, we have spent a large part of the existence of civilisation figuring out organisational structures and methods to create resilient processes using unreliable humans, and it turns out a lot of those methods also work on agents. People just often seem miffed that they have to apply them on computers too.
>the failure mode is invisible
Only if you are missing tests for what counts for you. And that's true for both dev-written code, and for vibed code.
Who writes the tests?
It doesn't seem obvious that it's a problem for LLM coders to write their own tests (if we assume that their coding/testing abilities are up to snuff), given human coders do so routinely.
This thread is talking about vibe coding, not LLM-assisted human coding.
The defining feature of vibe coding is that the human prompter doesn't know or care what the actual code looks like. They don't even try to understand it.
You might instruct the LLM to add test cases, and even tell it what behavior to test. And it will very likely add something that passes, but you have to take the LLM's word that it properly tests what you want it to.
The issue I have with using LLM's is the test code review. Often the LLM will make a 30 or 40 line change to the application code. I can easily review and comprehend this. Then I have to look at the 400 lines of generated test code. While it may be easy to understand there's a lot of it. Go through this cycle several times a day and I'm not convinced I'm doing a good review of the test code do to mental fatigue, who knows what I may be missing in the tests six hours into the work day?
> This thread is talking about vibe coding, not LLM-assisted human coding.
I was writing about vibe-coding. It seems these guys are vibe-coding (https://factory.strongdm.ai/) and their LLM coders write the tests.
I've seen this in action, though to dubious results: the coding (sub)agent writes tests, runs them (they fail), writes the implementation, runs tests (repeat this step and last until tests pass), then says it's done. Next, the reviewer agent looks at everything and says "this is bad and stupid and won't work, fix all of these things", and the coding agent tries again with the reviewer's feedback in mind.
Models are getting good enough that this seems to "compound correctness", per the post I linked. It is reasonable to think this is going somewhere. The hard parts seem to be specification and creativity.
Maybe it’s just the people I’m around but assuming you write good tests is a big assumption. It’s very easy to just test what you know works. It’s the human version of context collapse, becoming myopic around just what you’re doing in the moment, so I’d expect LLMs to suffer from it as well.
> the human version of context collapse, becoming myopic around just what you’re doing in the moment
The setups I've seen use subagents to handle coding and review, separately from each other and from the "parent" agent which is tasked with implementing the thing. The parent agent just hands a task off to a coding agent whose only purpose is to do the task, the review agent reviews and goes back and forth with the coding agent until the review agent is satisfied. Coding agents don't seem likely to suffer from this particular failure mode.
the right person is the tax accountant
[dead]
I have zero issues with things going sideways on even the most complicated task. I don't understand why people struggle so much, it's easy to get it to do the right thing without having to hand hold you just need to be better at what you're asking for.
Sounds like you need to add more tests to your code. The AI is pretty good at that.
> A hallucinated edge case in a tax calculation doesn't throw an error.
Would double entry book keeping not catch this?
Not necessarily. Double entry bookkeeping catches errors in cases where an amount posted to one account does not have an equally offsetting post in another account or accounts (i.e., it catches errors when the books do not balance). It would not on its own catch errors where the original posted amount is incorrect due to a mistaken assumption, or if the offset balances but is allocated incorrectly.
Well the other ledgers are usually based off other data sources, so there is cross checking no?
Do you by any chance work on open source accounting?
[dead]