I wonder how much tokens and time where used for the verifying part. Maybe GLM 5.2 instantly found the "solution" to read the screen pixel by pixel, but it could also have been a major token and time consumer.
I wonder how much tokens and time where used for the verifying part. Maybe GLM 5.2 instantly found the "solution" to read the screen pixel by pixel, but it could also have been a major token and time consumer.
Hi, author here, I cannot give an exact number for how many token the verification step took, but the verification GLM 5.2 ran was very stupid and definitely a waste of time. It read the pixel color data to try and verify the scene rendered properly. Which is really bad. Opus opened the game in a Playwright browser and took screenshots to verify the actual image. Which helped a lot.
Pro tip: You could use a multi-modal model to verify images as a subagent spawned by GLM 5.2, to get around this issue.
That's a dumb way to do it, it should just write the frame buffer to a PNG instead of taking screenshots. I guess you can't take the dumb web developer ways out of these models at the end of the day.
I could be wrong but I believe this is a non-vision model. Please weigh in to correct me bc I would love to be wrong
GLM 5.2 is text only, not multi modal. And Opus is multi modal.