Thanks for building this!

It struggled with tasks I asked for (e.g. download the March and April invoices for my GitHub org "myorg") -- it got errors parsing the DOM and eventually gave up. I recommend taking a look at the browser-use approach and specifically their buildDOMTree.js script. Their strategy for turning the DOM into an LLM parsable list of interactive elements, and visually tagging them for vision models, is unreasonably effective. I don't know if they were the first to come up with it, but it's genius and extracting it for my browser-using agents has hugely increased their effectiveness.