Hacker News

I'm trying to use computer use and browser use (via playwright MCP) in my work. Computer use is a hit and miss (mostly miss), but playwright MCP often works very well. The downside is it takes a lot of time to complete even easy tasks.

For example, to automate processing emails, it needs to 1. go to Gmail 2. log in to Google if necessary (This often requires two step verification so it's hard to completely automating, but possible) 3. read the latest mail 4. check the content and choose the action - if needed, reply the email - if it mentions tasks, add them to the todo list - if it mentions schedules, add them to the calendar 5. repeat for all emails based on specified conditions. And each step requires dozens of DOM (a11y tree) analyzes and actions (fill username/password input, check keep logging in, click submit button, etc). Based on the model used, each step can take ~100s. So easy tasks can easily add up to tens of minutes or even hours.

For frequently used tasks, I write skills like /logging-in, /read-latest-emails, using playwright scripts and let the agent choose them And based on the email content, the agent chooses other tools like /write-reply, /add-todo, /add-event, etc, so that the model can only focus on the core tasks requiring thinking. It reduces the execution time drastically.

But it can buries important business logic in the playwright scripts, not the agent's instructions. For examples, simplified steps to add TODO items are like; 1. read the email 2. check if it's about todos, then decide to add them to Asana 3. extract and summarize the title, content, priority, due date, tags, etc. 3. access to Asana (log in if necessary) 4. check if there are similar tasks 5. if not, add the tasks This can take tens of minutes, and each step can have important business logic, like; - how to decide the priority and due date - how to choose tags based on the content - how to decide if two tasks are similar This information should be read and updated by not only developers, but managers and other teams. And if I write those steps in skills with playwright scripts, it improves the speed, but all those business logic are buried in the code, so not accessible by non-technical people. It's also error-prone because web sites often tweak the UI and scripts can stop working.

So it's very convenient if the agent processes these step once, then decides it's worth writing the playwright script so that the next time those mundate processs can be executed instantly.

With automatic skill generation, the agent decides by itself if there are workflows worth writing skills with playwright scripts, like /log-in, /extract-information, /check-similar-tasks, /add-tasks. Like Just-In-Time compiler, the skills are a byproduct of the agent instruction, all business logic are written in the agent's instruction, and doesn't need to be updated manually nor tracked in a version control system.

This can reduce a lot of execution time and API cost, and be applied other than browser automation, like computer use or any other agentic tasks if it's possible to write automation scripts for tasks not requiring thinking.