We need to update robots.txt for the LLM world, help them find things more efficiently (or not at all I guess). Provide specs for actions that can be taken. Etc.
We need to update robots.txt for the LLM world, help them find things more efficiently (or not at all I guess). Provide specs for actions that can be taken. Etc.
If current behaviour is anything to go by, they will ignore all such assistance, and instead insist on crawling infinite variations of the same content accessed with slightly different URL-patterns, plus hallucinate endless variations of non-existent but plausible looking URLs to hit as well until the server burns down - all on the off-chance that they might see a new unique string of text which they can turn into a paperclip.
There's no LLM in the loop at all, so any attempt to solve it by reasoning with an LLM is missing the point. They're not even "ignoring" assistance as sibling supposes. There simply is no reasoning here.
This is what you should imagine when your site is being scraped:
Sure, but at some point the idea is to train an LLM on these downloaded files no? I mean what is the point of getting them if you don't use them. So sure, this won't be interpreted during the crawling but it will become part of the knowledge of the LLM
You mean to add bad Monte-Carlo generated slop pages which are only advertised as no-go in the robots.txt file, right?