Hacker News

I also wonder; it's a normal scraper mechanism doing the scraping, right? Not necessarily an LLM in the first place so the wholesale data-sucking isn't going "read" the file even if it IS accessed?

Or is this file meant to be "read" by an LLM long after the entire site has been scraped?

hamdingers 11 hours ago [ - ]

Yes. It's a basic scraper that fetches the document, parses it for URLs using regex, then fetches all those, repeat forever.

I've done honeypot tests with links in html comments, links in javascript comments, routes that only appear in robots.txt, etc. All of them get hit.

efreak 6 hours ago [ - ]

What about scripted transformations? Or just add a simple timestamp to the query and only allow it to be used up to a week later? (Whether it works without the parameter could be tested too)

dumbfounder 8 hours ago [ - ]

We need to update robots.txt for the LLM world, help them find things more efficiently (or not at all I guess). Provide specs for actions that can be taken. Etc.

gamesieve 6 hours ago [ - ]

If current behaviour is anything to go by, they will ignore all such assistance, and instead insist on crawling infinite variations of the same content accessed with slightly different URL-patterns, plus hallucinate endless variations of non-existent but plausible looking URLs to hit as well until the server burns down - all on the off-chance that they might see a new unique string of text which they can turn into a paperclip.

hamdingers 4 hours ago [ - ]

There's no LLM in the loop at all, so any attempt to solve it by reasoning with an LLM is missing the point. They're not even "ignoring" assistance as sibling supposes. There simply is no reasoning here.

This is what you should imagine when your site is being scraped:

   def crawl(url):
    r = requests.get(url).text
    store(text)
    for link in re.findall(r'https?://[^\s<>"\']+', r):
        crawl(link)

flaburgan 3 hours ago [ - ]

Sure, but at some point the idea is to train an LLM on these downloaded files no? I mean what is the point of getting them if you don't use them. So sure, this won't be interpreted during the crawling but it will become part of the knowledge of the LLM

boothby 3 hours ago [ - ]

You mean to add bad Monte-Carlo generated slop pages which are only advertised as no-go in the robots.txt file, right?

olivia-banks 5 hours ago [ - ]

I assume this might be changing. Anecdotally, from what I've read here, I think we're starting to see headless browsers driven by LLMs for the purposes of scraping (to get around some of the content blocks we're seeing). Perhaps this is a solution to a problem that won't work now, but in the future, maybe.

reconnecting 13 hours ago [ - ]

Absolutely.

I assume that there are data brokers, or AI companies themselves, that are constantly scraping the entire internet through non-AI crawlers and then processing data in some way to use it in the learning process. But even through this process, there are no significant requests for LLMs.txt to consider that someone actually uses it.

giancarlostoro 11 hours ago [ - ]

I think it depends. LLMs now can look up things on the fly to bypass the whole "this model was last updated in December 2025" issue of having dated information. I've literally told Claude before to look up something after it accused me of making up fake news.