I'm curious how this works if the green team writes an implementation that makes a network call like an RPC.
Red team might not anticipate this if the spec does detail every expected RPC (which seems unreasonable: this could vary based on implementation). But a unit test would need mocks.
Is green team allowed to suggest mocks to add to the test? (Even if they can't read the tests themselves?) This also seems gamaeable though (e.g. mock the entire implementation). Unless another agent makes a judgement call on the reasonability of the mock (though that starts to feel like code review more generally).
Maybe record/replay tests could work? But there are drawbacks in the added complexity.
I think the solution here is: Don't mock and inject dependencies explicitly, as function parameters / monads / algebraic effects. Make side effects part of the spec/interface.