The "non-deterministic microservices" framing is exactly right and I think most infra teams underestimate how much that changes things. With a normal service, you can map inputs to expected outputs and write tests. With agents, the blast radius is probabilistic and context-dependent.

The monitoring problem alone is closer to fraud detection than traditional APM. You're not looking for "is this thing up," you're looking for "is this thing subtly wrong in a way that compounds over the next 10 steps."

I'd argue it's both. You also want to know when your agent has collapsed and is burning tokens and your budget.