Hacker News

latenightcoding 2 days ago [ - ]

when I used to crawl the web, battle tested Perl regexes were more reliable than anything else, commented urls would have been added to my queue.

rightbyte 2 days ago [ - ]

DOM navigation for fetching some data is for tryhards. Using a regex to grab the correct paragraph or div or whatever is fine and is more robust versus things moving around on the page.

chaps 2 days ago [ - ]

Doing both is fine! Just, once you've figured out your regex and such, hardening/generalizing demands DOM iteration. It sucks but it is what is is.

horseradish7k 2 days ago [ - ]

but not when crawling. you don't know the page format in advance - you don't even know what the page contains!