Hacker News

vetler 2 days ago [ - ]

My instinct was also to use LLMs for this, but it was way to slow and still expensive if you want to scrape millions of pages.

andrew_zhong 2 days ago [ - ]

Put things to perspective - Gemini 2.5 flash is 0.3/1M tokens - assuming each page is 700 tokens and output is not much you are looking at $210 for 1M pages

vetler 2 days ago [ - ]

You will absolutely struggle to get all the info you need into 700 tokens per page.

Edit: There's also the added complexity of running a browser against 1M pages, or more.

andrew_zhong 2 days ago [ - ]

I agree that When pages have similar structure, for one time extraction as it is (not reasoning from context), scraping with selectors is the way to go.

This library also supports HTML as input so running a browser is not required.