Using progressive disclosure with Codex is a fascinating way to handle complexity, but game physics are notoriously difficult to validate via text-based checklists. Since collision detection and movement can be subtle or visual, how did your Playwright setup distinguish between "working" mechanics and edge cases like glitching through walls? I'm curious if the Implement -> Evaluate loop ever got stuck cycling on a specific bug where the agent couldn't satisfy the test criteria without human intervention. Did you have to define specific tolerance thresholds for the physics engine to prevent false positives in the evaluation phase?
Incredibly I didn't do anything, I just told codex to use playwright cli, told it what to check (in plain English), and it did its thing. Looking at its log I can see that it was "playing" the game and defining its own test conditions, such as if the player/NPC is *not* on one of the "collidable" tiles, if the NPC is "going over the edge" of a collidable area, if it's facing the wrong way, etc. Sometimes it found bugs, e.g. it was testing for gravity checks and then it found that one of the movements was not working correctly and it went ahead and fixed it.
So essentially it uses CLI to read all the x,y coordinates, speed, timing, it took screenshots, and combined those together.
My learning from this is - just let the agent do it. Actually trying to interfere with specific conditions and checks lowers the agent's performance. Simply give it a guide.