Hacker News

dumbfounder 7 hours ago [ - ]

We need to update robots.txt for the LLM world, help them find things more efficiently (or not at all I guess). Provide specs for actions that can be taken. Etc.

gamesieve 6 hours ago [ - ]

If current behaviour is anything to go by, they will ignore all such assistance, and instead insist on crawling infinite variations of the same content accessed with slightly different URL-patterns, plus hallucinate endless variations of non-existent but plausible looking URLs to hit as well until the server burns down - all on the off-chance that they might see a new unique string of text which they can turn into a paperclip.

hamdingers 4 hours ago [ - ]

There's no LLM in the loop at all, so any attempt to solve it by reasoning with an LLM is missing the point. They're not even "ignoring" assistance as sibling supposes. There simply is no reasoning here.

This is what you should imagine when your site is being scraped:

   def crawl(url):
    r = requests.get(url).text
    store(text)
    for link in re.findall(r'https?://[^\s<>"\']+', r):
        crawl(link)

flaburgan 3 hours ago [ - ]

Sure, but at some point the idea is to train an LLM on these downloaded files no? I mean what is the point of getting them if you don't use them. So sure, this won't be interpreted during the crawling but it will become part of the knowledge of the LLM

boothby 3 hours ago [ - ]

You mean to add bad Monte-Carlo generated slop pages which are only advertised as no-go in the robots.txt file, right?