We maintain a constant browser state that gets fed into the system prompt which shows the most recent results, current page, where you are in the content, what actions are available, etc. It's markdown by default but can switch to HTML if needed (for pagination or CSS selectors). The agent always has full context of its browsing session.
A few design decisions we made that turned out pretty interesting:
1. We gave it an analyze results function. When the agent is on a search results page, instead of visiting each page one by one, it can just ask "What are the pricing models?" and get answers from all search results in parallel.
2. Long web pages get broken into chunks with navigation hints so the agent always knows where it is and can jump around without overloading its context ("continue reading", "jump to middle", etc.).
3. For sites that are commonly visited but have messy layouts or spread out information, we built custom tool calls that let the agent request specific info that might be scattered on different pages and consolidates it all into one clean text response.
4. We're adding DOM interaction via text in the next couple of days, so the agent can click buttons, fill forms, enter keys, but everything still comes back as structured text instead of screenshots.
Thanks. If I am interpreting this correctly, what you have is not a browser but a translation layer. You are still using something that scrapes the data and then you translate it to be in the format that works best for your agent.
My original interpretation was that you had built a full blown browser, something akin to a Chromium/Firefox fork
We maintain a constant browser state that gets fed into the system prompt which shows the most recent results, current page, where you are in the content, what actions are available, etc. It's markdown by default but can switch to HTML if needed (for pagination or CSS selectors). The agent always has full context of its browsing session.
A few design decisions we made that turned out pretty interesting:
1. We gave it an analyze results function. When the agent is on a search results page, instead of visiting each page one by one, it can just ask "What are the pricing models?" and get answers from all search results in parallel.
2. Long web pages get broken into chunks with navigation hints so the agent always knows where it is and can jump around without overloading its context ("continue reading", "jump to middle", etc.).
3. For sites that are commonly visited but have messy layouts or spread out information, we built custom tool calls that let the agent request specific info that might be scattered on different pages and consolidates it all into one clean text response.
4. We're adding DOM interaction via text in the next couple of days, so the agent can click buttons, fill forms, enter keys, but everything still comes back as structured text instead of screenshots.
Thanks. If I am interpreting this correctly, what you have is not a browser but a translation layer. You are still using something that scrapes the data and then you translate it to be in the format that works best for your agent.
My original interpretation was that you had built a full blown browser, something akin to a Chromium/Firefox fork
Yes, translation layer would probably be better terminology.