I also did computer agents with a vc backed startup, ran into the same issues, and we built a fairly similar thing at one point.

It’s useful but it has limitations, it seems to only work well in environments that are perfectly predictable otherwise it gets in the way of the agent.

I think I prefer RL over these approaches but it requires a bit more data.