https://www.sigops.org/2026/can-llms-model-real-world-system...

"Running leading LLMs across the eleven systems shows that LLMs are great at producing correct TLA+ syntax but struggle to ensure conformance and appropriate invariants."

I think that's what author implies by this sentence in the intro:

> It’s still your responsibility to understand your system and define what “correctness” means, and you need a high-level understanding of temporal logic.

As TLA+ uses state machines that can define infinite state spaces, checking arbitrary temporal logic formulas is undecidable in the general case.

Using a TLC model checker to verify invariants or properties, finding a valid counterexample can scale non-elementarily.

In fact, in some situations it can end up being Ackermann-complete[0], but even recursively enumerable problems can be Tower-complete.

I would say that unless you focus on more detailed temporal logic pitfalls you may have issues.

So IMHO this will stay a use case specific solution, that you choose based on context.

Even a common solution like adding Circumscription causes counterintuitive changes [1][2].

IMHO, if you want to use TLA+ as a primary method, you will need some depth or be ready to abandon it by time boxing etc…

Remember that we know the open domain frame problem in [2] is equal to HALT, it will not universally apply.

It is just another tool that works well when it works well.

[0] https://arxiv.org/abs/2104.13866

[1] https://arxiv.org/abs/2407.20822

[2] https://en.wikipedia.org/wiki/Circumscription_(logic)

OTOH if you’re developing a service with a frontend and a backend you are firmly in the distributed system territory even if you only have one user, so it doesn’t have to work all the time; it’s enough if it works approximately more often than not to be an improvement.

This blog post made it more difficult than it should to find the actual tool (needed to click twice! abysmal, really): https://github.com/specula-org/Specula

Ran it on my tricky caching changes and it found bugs the standard agentic review flows missed on all frontier models. Good stuff.