when I used to crawl the web, battle tested Perl regexes were more reliable than anything else, commented urls would have been added to my queue.
when I used to crawl the web, battle tested Perl regexes were more reliable than anything else, commented urls would have been added to my queue.
DOM navigation for fetching some data is for tryhards. Using a regex to grab the correct paragraph or div or whatever is fine and is more robust versus things moving around on the page.
Doing both is fine! Just, once you've figured out your regex and such, hardening/generalizing demands DOM iteration. It sucks but it is what is is.
but not when crawling. you don't know the page format in advance - you don't even know what the page contains!