Interesting thread. One angle I'd add: when you run multiple AI agents, the coordination problem becomes the dominant failure mode.
Specifically, shared state management — agents reading and writing concurrently without collision detection leads to silent failures that look like model quality issues but are actually concurrency bugs.
We open-sourced a coordination layer for this: https://github.com/Jovancoding/Network-AI