outsourcing testing the AI also gets its code to be connected to deterministic results, and show let the agent interact with the code to speculate expectations and check them against the actual code.

it could still speculate wrong things, but it wont speculate that the code is supposed to crash on the first line of code