Hacker News

From unit tests to whole universe tests (with will wilson of antithesis) [video]

34 points by zdw 3 days ago | 19 comments

I found helpful this explanation of what Antithesis isn't:

> Property-based testing vs. Antithesis

> Property-based testing (PBT) uses random inputs to check individual data structures, procedures, or occasionally whole programs for high-level invariants or properties. Property-based testing has much in common with fuzzing—the main differences are heritage (PBT comes from the functional programming world, while fuzzing comes from the security/systems programming world) and focus (program functionality vs. security issues). Like fuzzing, PBT is generally only applicable to self-contained libraries and processes.

> Antithesis is analogous to applying PBT to an entire interacting software system—including systems that are concurrent, stateful, and interactive. Antithesis can randomly vary the inputs to a software program, and also the environment within which it runs. Like a PBT system, Antithesis is designed to check high-level properties and invariants of the system under test, but it can do so with many more types of software.

I've scrubbed through the video, and it seems to be 100% talking-head filler except for an outro still image—no actual video information content at all unless you want to analyze Wilson's facial expressions or think he's hot.

Regular reminder that yt-dlp (--write-sub --write-auto-sub --sub-lang en) can download subtitles that you can read, grep, and excerpt, so you don't have to watch videos like this unless you like to.

lioeters an hour ago [ - ]

Great tip about downloading subtitles, useful!

stogot 15 hours ago [ - ]

Thanks for the auto sub I didn’t know it is a feature.

How did you get ytdlp to work? It used to work for me and I just did a fresh install a week ango and now youtube is giving me auto/cookie/sign in errors (captcha I presume?) when it didn’t before

tczMUFlmoNk 15 hours ago [ - ]

As a general rule, you should update yt-dlp before using it. They release new versions very frequently to work around new walls on YouTube and other platforms. An update usually solves this kind of issue for me, even if I've updated just a few days ago.

(I haven't tried it today so can't speak to whether this is a complete solution in this particular case.)

kragen 2 hours ago [ - ]

At the moment I'm getting "HTTP Error 429: Too Many Requests" (with yt-dlp-2025.9.5 installed in a virtualenv via pip), which has been happening more often recently. I got it when downloading the Spanish subtitles file after successfully downloading the English one, so yt-dlp didn't continue on to try to download the video. But YouTube has also been working unreliably for me in the browser.

Edit: a few minutes later it worked, although I didn't let it download the whole video, because it was huge. The subtitle file is 12631 words processed with http://canonical.org/~kragen/sw/dev3/devtt.py. That's about 38 minutes of reading.

One drawback of the transcript in this case is that it doesn't identify the speaker. It doesn't seem to contain many errors.

The key point seems to be this one (18'06"):

> But what you what you what you want to do is use guidance and use feedback from the system under test to optimize the search and notice when things have interesting things have happened, things that aren't necessarily bugs, but that are rare behavior or special behavior or unusual behavior. And so the test system can see that something interesting has happened and follow up opportunistically on that discovery. And that gives you a massive lift in the speed of finding issues.

> And the way that we're able to do that is with this sort of magical hypervisor that we've developed which allows us to deterministically and perfectly recreate any past system state.

> So people generally think of the value of that hypervisor as like any issue we find is reproducible. Nothing is nothing is works on my machine. If we find it once we can repro it for you add infin item.

Including reproducibility that isn't of phenomena that are, strictly speaking, computational:

> like all of the like very low-level decisions about when threads get scheduled or how long particular operations take or you know exactly how long a packet takes to get from node A to node B will reproduce 100% perfectly from run to run.

But, interestingly, they're not targeting things like multicore race conditions, even though their approach is the only way you could make them reproducible; instead they just always do some kind of thread interleaving (though they do change the thread interleaving order):

> If you did it that way, you could like a cycle accurate CPU simulator, you could find all kinds of like weird bugs that required like true multicore parallelism or like you know weird me atomic memory operations, stuff like that. Yeah. Um, we are not trying to find those bugs because 99.999% of developers can never even think about those bugs, right? Like we're trying to we're trying to find we're trying to find more more everyday type stuff.

Also:

> 99% of your CPU instructions are just executing on the host CPU and it's very fast. Um and so that that means there's not much performance overhead at all to doing this which is which is I think really important to making it actually practical.

I'm guessing this means they're using the hypervisor virtualizable instruction set extensions on amd64 CPUs (VT) just like Xen or whatever.

I found amusing the analogy of deterministic-replay-based time-travel fuzzing (like American Fuzzy Lop does) to save-scumming:

> But the crazy thing is once I have a time machine, once I have a hypervisor, I can run until I make event A happen. And then if I notice that event A has happened, I can say this is interesting. I want to now just focus on worlds where event A has happened. I don't need to refind event A every single time. I can just lock it in, right? It's like if you play computer games, it's like save scumming, right? It's like I can I can just save my state when I got the boss down to half health and now always reload from that point.

> And so it takes me a thousand trials to get event A to happen and now just another thousand to get B to happen instead of it taking a million trials if I always have to start from the start.

A lot of the content of the interview is not going to be novel if you're familiar with things like afl_fuzz, data prevalence, or time-travel debugging, but it's pretty interesting to read about what their experiences are.

As far as I know, though, this is novel:

> when we actually do find a bug we can then go back and and ask when did the bug become inevitable right this is this is kind of kind of crazy

> how how

> right we can we can we can go back to the previous time that we reached in and changed the future and we can try changing it to like a hundred different things and see if they all still hit the bug. And if they do, it means the bug was already baked in. And then we can go back to the next one before that and do the same thing.

> Yeah. Yeah.

> And we can sort of bisect backwards and then we can find the exact moment when the bug went from really unlikely to really likely. And then we can do things like look at which lines of code were running then, you know, look at, you know, look at all all you know what what log messages were being printed then. And often that is actually enough to root cause the bug too.

jatins 14 hours ago [ - ]

Is there a demo of what Antithesis does? I have seen it on HN a few times and I like the idea of monkey typing a system. But how does it work in practice? Does it call my APIs, does it introduce memory corruptions, does it bring down my containers...what does it do?

vlovich123 13 hours ago [ - ]

It arbitrarily reorders events across the entire “universe” and injects reasonable kinds of faults (eg dropping or reordering packets). It does so by running all events for all threads across all machines in a deterministic “random” order by serializing on a single thread and the randomness is initialized by the seed for that run. It also runs the universe in faster than real time since there’s no actual network delay or time elapsing (that too is simulated).

You generate the workload by defining your test case the same as property tests or traditional example tests. You cannot call arbitrary network services.

typpilol 13 hours ago [ - ]

Is it like Stryker basically? Mutation testing?

Or it like a super set of mutation testing?

vlovich123 3 hours ago [ - ]

Mutation testing is poorly named in some sense. Mutation testing is probabilistic measurement of code coverage. It doesn’t test anything, instead it tells you how good your existing test suite provides coverage of non-existent bugs - it changes your code to generate a “mutant” (eg changes an addition to a subtraction) and sees if your test suite still passes - if it does that counts as a failure. Traditional alternative approaches are things like codecov that measure line coverage or branch coverage which famously don’t actually give you an actually accurate estimate of quality whereas mutation testing does a bit better job. However, none of this actually tests anything and is more a meta-metric of how good your test coverage is.

Antithesis is more like property testing - it tests your code under random scenarios and sees if the tests still pass. Unlike property testing, rather than just random inputs, it also randomly reorders events at a very deep level to make sure your distributed system behaves correctly. It can even be used for simple things like helping you deterministically reproduce a flaky test in a non-distributed system.

narsa123 3 days ago [ - ]

Any tools we could use to test mobile apps automation testing using AI (Like MCPs for mobile app testing)??

fitzn 17 hours ago [ - ]

Reflect tests mobile apps by converting plain text instructions into appium commands at runtime using AI. Your tests are just the text steps.

https://reflect.run/mobile-testing/

disclaimer: I co-founded Reflect.

dlahoda 14 hours ago [ - ]

do they have sustained at least one prominent rust first and rust core customer? i doubt, rust has a lot of tooling and catches at compile time what their product does in runtime.

also not sure about antithesis biz practices. you pay them for integration, you spend time educating them and improving their product. and in the end get vendor locking on their compute with arbitrary non transparent pricing.

if your are not in rust - sure it can be price efficient.

vlovich123 13 hours ago [ - ]

I think you are misunderstanding. Rust does not solve or prevent distributed systems bugs, just memory safety and certain kinds of thread safety problems. For that you’d need to use a formal proof system like Coq.

There’s a reason you should still be writing unit tests and hypothesis/property tests in Rust, to catch issues the compiler can’t catch at runtime which is a huge surface area.

lmm 11 hours ago [ - ]

> There’s a reason you should still be writing unit tests and hypothesis/property tests in Rust, to catch issues the compiler can’t catch at runtime which is a huge surface area.

It would be irresponsible to suggest that Rust eliminates a large enough proportion of common errors that you can YOLO anything that compiles and achieve an acceptable defect rate... but it does happen to be true in my experience.

dlahoda 9 hours ago [ - ]

yes.

tests test what? logic. logic = types (proven). so stronger type system less test needs to be written.

more, if proc macro or build.rs can execute logic, based on parsed """types"""(partial info only), we can extend type systems with custom logic (and panic at compile time and/or startup time if usage violation detected).

on top of that, add fail fast (fail at compile time, build time or start up time) and newtype and errors-part-of-api culture; and lack of late binding (dyn is very limited use, no runtime reflection), and we get even less reasons to write tests.

some examples of industrial """typing"""(eDSLs, construction time) solutions in rust :

- https://github.com/elastio/bon

- https://github.com/contextgeneric/cgp

- https://github.com/paritytech/orchestra

- https://git.pipapo.org/cehteh/linear_type.git

sure we need write tests, and tests like antithesis helps with.

but list of tools helping with tests exactly as antithesis does(and more of others) is huge. that is built on top of absolutely strong supply chain audit, quality and security. there is even """levels""" of determinism tooling to decide how much to pay for cloud compute.

vlovich123 3 hours ago [ - ]

Ok. Please write me an implementation of RAFT using no tests and have the Rust type checker prove correctness. I admit complete ignorance into how to get Rust to even go about partially proving that.

dlahoda 12 hours ago [ - ]

i guess most of issues anthitesis finds are preventable by simple or more evolved rust (patterns).

in rust i just have more time for other things you mentioned.

also it is clear you are misunderstand rust. rust type and macro system allow to write adhoc partial proves of things around in my code with no extra tooling. that is easy bits what rust adds on top of thread and mem.

and definetely i do need to run for help of rocq right away, rust ecosystem has a lot of options.

also not only lang itself matters, but also cargo which goes along.

Smaug123 11 hours ago [ - ]

You still seem to be completely misunderstanding, as is evident from the fact that your argument "proves" that in Rust you don't even need to write any tests. Again, Antithesis is designed to test distributed systems, deterministically.

dlahoda 11 hours ago [ - ]

sorry, where exactly i stated no need to write tests?

i argue, overall, that antithesis less likely will be adopted in rust because language itself (in sense extended to typelevel patterns and in macro simulations) and its ecosystem (by ecosystem i mean available libraries and tools and integrations which cover a lot of antithesis agenda). i did not expanded ecosystem argument so, because there was objection to that yet.