Love the idea & spitballing ways to generalize to coding..

Thought experiment: as you write code, an LLM generates tests for it & the IDE runs those tests as you type, showing which ones are passing & failing, updating in real time. Imagine 10-100 tests that take <1ms to run, being rerun with every keystroke, and the result being shown in a non-intrusive way.

The tests could appear in a separated panel next to your code, and pass/fail status in the gutter of that panel. As simple as red and green dots for tests that passed or failed in the last run.

The presence or absence and content of certain tests, plus their pass/fail state, tells you what the code you’re writing does from an outside perspective. Not seeing the LLM write a test you think you’ll need? Either your test generator prompt is wrong, or the code you’re writing doesn’t do the things you think they do!

Making it realtime helps you shape the code.

Or if you want to do traditional TDD, the tooling could be reversed so you write the tests and the LLM makes them pass as soon as you stop typing by writing the code.

Humans writing the test first and LLM writing the code is much better than the reverse. And that is because tests are simply the “truth” and “intention” of the code as a contract.

When you give up the work of deciding what the expected inputs and outputs of the code/program is you are no longer in the drivers seat.

> When you give up the work of deciding what the expected inputs and outputs of the code/program is you are no longer in the drivers seat.

You don’t need to write tests for that, you need to write acceptance criteria.

What are tests but repeatable assertions of said acceptance criteria?

> You don’t need to write tests for that, you need to write acceptance criteria.

Sir, those are called tests.

I see you have little experience with Scrum...

Acceptance criteria is a human-readable text that the person specifying the software has to write to fill-up a field in Scrum tools and not at all guide the work of the developers.

It's usually derived from the description by an algorithm (that the person writing it has to run on their mind), and any deviation from that algorithm should make the person edit the description instead to make the deviation go away.

> Acceptance criteria is a human-readable text that the person specifying the software has to write (...)

You're not familiar with automated testing or BDD, are you?

> (...) to fill-up a field in Scrum tools (..)

It seems you are confusing test management software used to tracks manual tests with actual acceptance tests.

This sort of confusion would be ok 20 years ago, but it has since went the way of the dodo.

As someone quite familiar with human-run project management and Scrum, I believe parent was posting quite facetiously.

As in, a developer would write something in e.g. gherkin, and AI would automatically create the matching unit tests and the production code?

That would be interesting. Of course, gherkin tends to just be transpiled into generated code that is customized for the particular test, so I'm not sure how AI can really abstract it away too much.

All of this at the end reduces to a simple fact at the end of the discussion.

You need some of way of precisely telling AI what to do. As it turns out there is only that much you can do with text. Come to think of it, you can write a whole book about a scenery, and yet 100 people will imagine it quite differently. And still that actual photograph would be totally different compared to the imagination of all those 100 people.

As it turns out if you wish to describe something accurately enough, you have to write mathematical statements, in other words statements that reduce to true/false answers. We could skip to the end of the discussion here, and say you are better of either writing code directly or test cases.

This is just people revisiting logic programming all over again.

> You need some of way of precisely telling AI what to do.

I think this is the detail you are not getting quite right. The truth of the matter is that you don't need precision to get acceptable results, at least in 100% of the cases. As everything in software engineering, there is indeed "good enough".

Also worth noting, LLMs allow anyone to improve upon "good enough".

> As it turns out if you wish to describe something accurately enough, you have to write mathematical statements, in other words statements that reduce to true/false answers.

Not really. Nothing prevents you to refer to high-level sets of requirements. For example, if you tell a LLM "enforce Google's style guide", you don't have to concern yourself with how many spaces are in a tab. LLMs have been migrating towards instruction files and prompt files for a while, too.

Yes, you are right. But in the sense that a human decides if AI generated code is right.

But if you want a near 100% automation, you need precise way to specify what you want, else there is no reliable way interpreting what you mean. And by that definition lots of regression/breakage has to be endured everytime a release is made.

I’m talking higher level than that. Think about the acceptance criteria you would put in a user story. I’m specifically responding to this:

> When you give up the work of deciding what the expected inputs and outputs of the code/program is you are no longer in the drivers seat.

You don’t need to personally write code that mechanically iterates over every possible state to remain in the driver’s seat. You need to describe the acceptance criteria.

> When you give up the work of deciding what the expected inputs and outputs of the code/program is you are no longer in the drivers seat.

You're describing the happy path of BDD-style testing frameworks.

I know about BDD frameworks. I’m talking higher level than that.

> I know about BDD frameworks. I’m talking higher level than that.

What level do you think there is above "Given I'm logged in as a Regular User When I go to the front page Then I see the Profile button"?

The line you wrote does not describe a feature. Typically you have many of those cases and they collectively describe one feature. I’m talking about describing the feature. Do you seriously think there is no higher level than given/when/thens?

Could you give an example? It's not that I don't believe there are higher levels - I just don't want to guess what you might be hinting at.

> The line you wrote does not describe a feature.

I'm describing a scenario as implemented in a gherkin feature file. A feature is tracked by one or more scenarios.

https://cucumber.io/docs/gherkin/reference/

> Do you seriously think there is no higher level than given/when/thens?

You tell me which higher level you have in mind.

I'm curious what it could possibly be too. I guess he's trying to say the comments you might make at the top of a feature file to describe a feature would be his goal, but I'm not aware of a structured way to do that.

The problem is that tests are for the unhappy path just as much as the happy path, and unhappy paths tend to get particular and detailed, which means even in gherkin it can get cumbersome.

If AI is to handle production code, the unhappy paths need to at least be certain, even if repetitive.

I think your perspective is heavily influenced by the imperative paradigm where you actually write the state transition. Compare that to functional programming where you only describe the relation between the initial and final state. Or logic programming where you describe the properties of the final state and where it would find the elements with those properties in the initial state.

Those does not involves writing state transitions. You are merely describing the acceptance criteria. Imperative is the norm because that's how computers works, but there are other abstractions that maps more to how people thinks. Or how the problem is already solved.

I didn’t mention state transitions. When I said “mechanically iterate over every possible state”, I was referring to writing tests that cover every type of input and output.

Acceptance criteria might be something like “the user can enter their email address”.

Tests might cover what happens when the user enters an email address, what happens when the user tries to enter the empty string, what happens when the user tries to enter a non-email address, what happens when the user tries to enter more than one email address…

In order to be in the driver’s seat, you only need to define the acceptance criteria. You don’t need to write all the tests.

> "the user can enter their email address”

That only defines one of the things the user can enter. Should they be allowed to enter their postal address? Maybe. Should they be allowed to enter their friend's email address? Maybe.

Your acceptance criteria is too shy of details.

Acceptance criteria describes the thing being accepted, it describes a property of the final state.

There is no prescriptive manner in which to deliver the solution, unless it was built into the acceptance criteria.

You are not talking about the same thing as the parent.

> That would be interesting. Of course, gherkin tends to just be transpiled into generated code that is customized for the particular test, so I'm not sure how AI can really abstract it away too much.

I don't think that's how gherkin is used. Take for example Cucumber. Cucumber only uses it's feature files to specify which steps a test should execute, whereas steps are pretty vanilla JavaScript code.

In theory, nowadays all you need is a skeleton of your test project, including feature files specifying the scenarios you want to run, and prompt LLMs to fill in the steps required by your test scenarios.

You can also use a LLM to generate feature files, but if the goal is to specify requirements and have a test suite enforce them, implicitly the scenarios are the starting point.

>>Humans writing the test first and LLM writing the code is much better than the reverse.

Isn't that logic programming/Prolog?

You basically write the sequence of conditions(i.e tests in our lingo) that have to be true, and the compiler(now AI) generates code for your.

Perhaps there has to be a relook on how Logic programming can be done in the modern era to make this more seamless.

Yes this is fundamental to actually designing software. Still, it would be perfectly reasonable to ask "please write a test which gives y output for x input".

I disagree. You can simply code in a way that all test passes and you have more problem than before reviewing the code that is being generated.

There's no way this would work for any serious C++ codebase. Compile times alone make this impossible

I'm also not sure how LLM could guess what the tests should be without having written all of the code, e.g. imagine writing code for a new data structure

> There's no way this would work for any serious C++ codebase. Compile times alone make this impossible

There's nothing in C++ that prevents this. If build times are your bogeyman, you'd be pleased to know that all mainstream build systems support incremental builds.

The original example was (paraphrasing) "rerunning 10-100 tests that take 1ms after each keystroke".

Even with incremental builds, that surely does not sound plausible? I only mentioned C++ because that's my main working language, but this wouldn't sound reasonable for Rust either, no?

> The original example was (paraphrasing) "rerunning 10-100 tests that take 1ms after each keystroke".

Yeah, OP's point is completely unrealistic and doesn't reflect real-world experience. This sort of test watchers is mundane in any project involving JavaScript, and not even those tests re-run at each keystroke. Watch mode triggers tests when they detect changes, and waits for test executions to finish to re-run tests.

This feature consists of running a small command line app that is designed to run a command whenever specific files within a project tree are touched. There is zero requirement to only watch for JavaScript files or only trigger npm build when a file changes.

To be very clear, this means that right now anyone at all, including you and me, can install a watcher, configure it to run make test/cutest/etc when any file in your project is touched, and call it a day. This is a 5 minute job.

By the way, nowadays even Microsoft's dotnet tool supports watch mode, which means there's out-of-the-box support to "rerunning 10-100 tests that take 1ms after each keystroke".

Some languages make this harder than others, and languages that require expensive compilation step will certainly make it hard, while e.g. interpreted languages that allows dynamic reloading of code can potentially make it easy - allowing preloading of the tests and reloading of the modified code.

If you also don't expect necessarily running the entire test suite, but just a subset of tests that are, say, labelled to test a specific function only without expensive setup, it'd potentially be viable.

You can also ignore running it on every keypress with some extra work:

- Keypresses that don't change the token sequence (e.g. because you're editing a comment) does not require re-running any tests. - Keypresses that results in a syntactically invalid file does not require re-running any tests, just marking the error.

I think it'd be an interesting experiment to have editing rather than file save trigger a test-suite watcher. My own editor syncronises the file state to a server process that other processes can observe, so if I wanted to I could wire a watcher up to re-tokenize an edited line and trigger the test suite (the caveat being I'd need to deal with the file state not being on the file system) when the state changes instead of just on save. It already retokenizes the line for syntax highlighting anyway.

It doesn't sound reasonable for any language tbh, tests don't run that fast and running after each keystroke instead of on save or after a debouncing delay is just wasteful. If you amortize / ignore run times, load, and ignore the annoyance of tests blinking red/green at every keystroke then I suppose it would be alright.

Then do you need tests to validate your tests are correct, otherwise the LLM might just generate passing code even if the test is bad? Or write code that games the system because it's easier to hardcode an output value then to do the actual work.

There probably is a setup where this works well, but the LLM and humans need to be able to move across the respective boundaries fluidly...

Writing clear requirements and letting the AI take care of the bulk of both sides seems more streamlined and productive.

The harder part is “test invalidation”. For instance if a feature no longer makes sense, the human / test validator must painstakingly go through and delete obsolete specs. An idea I’d like to try is to “separate” the concerns; only QA agents can delete specs, engineer agents must conform to the suite, and make a strong case to the qa agent for deletion.

> Thought experiment: as you write code, an LLM generates tests for it & the IDE runs those tests as you type, showing which ones are passing & failing, updating in real time. Imagine 10-100 tests that take <1ms to run, being rerun with every keystroke, and the result being shown in a non-intrusive way.

I think this is a bad approach. Tests enforce invariants, and they are exactly the type of code we don't want LLMs to touch willy-nilly.

You want your tests to only change if you explicitly want them to, and even then only the tests should change.

Once you adopt that constraint, you'll quickly realize ever single detail of your thought experiment is already a mundane workflow in any developer's day-to-day activities.

Consider the fact that watch mode is a staple of any JavaScript testing framework, and those even found their way into .NET a couple of years ago.

So, your thought experiment is something professional software developers have been doing for what? A decade now?

I think tests should be rewritten as much as needed. But to counter the invariant part, maybe let the user zoom back and forth through past revisions and pull in whatever they want to the current version, in case something important is deleted? And then allow “pinning” of some stuff so it can’t be changed? Would that solve for your concerns?

> I think tests should be rewritten as much as needed.

Yes, I agree. The nuance is that they need to be rewritten independently and without touching the code. You can't change both and expect to get a working system.

I'm speaking based on personal experience, by the way. Today's LLMs don't enforce correctness out of the box and agent mode has only one goal: getting things to work. I had agent mode flip invariants in tests when trying to fix unit tests it broke, and I'm talking about egregious changes such as flipping requirements in line with "normal users should not have access to the admin panel" to "normal users should have access to the admin panel". The worst part is that if agent mode is left unsupervised, it will even adjust the CSS to make sure normal users have a seamless experience going through the admin panel.

Agreed that's a concern.

There could be some visual language for how recently changes happened to the LLM-generated tests (or code for TDD mode).. then you'd be able to see that a test failed and was changed recently. Would that help?

> Imagine 10-100 tests that take <1ms to run, being rerun with every keystroke, and the result being shown in a non-intrusive way.

Even if this were possible, this seems like an absolutely colossal waste of energy - both the computer's, and my own. Why would I want incomplete tests generated after every keystroke? Why would I test an incomplete if statement or some such?

> Imagine 10-100 tests that take <1ms to run, being rerun with every keystroke, and the result being shown in a non-intrusive way.

Doesn’t seem like high ROI to run full suite of tests on each keystroke. Most keystrokes yield an incomplete program, so you want to be smarter about when you run the tests to get a reasonably good trade off.

[deleted]

You could prune this drastically by just tokenizing the file with a lexer suitable for the language, turn them into a canonical state (e.g. replace the contents of any comment tokens with identical text), and check if the token state has changed. If you have a restartable lexer, you can even re-tokenize only from the current line until the state converges again or you encounter a syntax error.

That's already part of most IDE's and they know which tests to re-run, because of coverage, so it's really fast.

It also updates the coverage on the fly, you don't even have to look at the test output to know that you've broken something since the tests are not reaching your lines.

https://gavindraper.com/2020/05/27/VS-Code-Continious-Testin...

WallabyJS does something along these lines, although I don’t think it is contextually understanding which tests to highlight

https://wallabyjs.com/

Yes the reverse makes much more sense to me. AI help to spec out the software & then the code has an accepted definition of correctness. People focus on this way less than they should I think

Besides generating the tests, automatically running tests on edit and showing the results inline is already a thing. I think it'd be better to do it the other way around, start with the tests and let the LLM implement it until all tests are green. Test driven development.