The page requires JS to load its content - user agents without JS support just get a blank page.
I'm not sure if the bots that scrape data to train LLMs are capable of loading that type of page, or if they only work on pages that have the content inside the HTML itself?
any serious scraping service these days will fail over to a headless browser when it fetches an asset referencing a js bundle that isn't verifiably a vendor script
The page requires JS to load its content - user agents without JS support just get a blank page.
I'm not sure if the bots that scrape data to train LLMs are capable of loading that type of page, or if they only work on pages that have the content inside the HTML itself?
Not using JavaScript would also make the crawler fail on squarespace and wix website builders.
The age where the web was usable at all without JavaScript is long gone. No scraper would get much scraping done without JavaScript these days.
any serious scraping service these days will fail over to a headless browser when it fetches an asset referencing a js bundle that isn't verifiably a vendor script
I'm aware and will implement SSR soon ;)
It's entirely possible they simply ingest the JS as-is.