Well, obviously it's controlling your computer too - controlling mouse and keyboard input, and has been trained to know how to interact with apps (how to recognize and use UI components). It's not clear exactly what all the moving parts are and how they interact.

I wouldn't be so dismissive - you could describe GPT-o1 in same way "it just loops until it gets to the solution". It's the details and implementation that matter.