Hacker News

I'm not sure why they don't just cache the websites and avoid going back for at least 24 hours, especially in the case of most sites. I swear its like we're re-learning software engineering basics with LLMs / AI and it kills me.

kpw94 5 days ago [ - ]

Yeah the landscpe when there were many more Search engines must have been exactly the same...

I think the eng teams behind those were just more competent / more frugal on their processing.

And since there wasn't any AWS equivalent, they had to be better citizens since well-known IP range ban for the crawled websites was trivial.

danudey 4 days ago [ - ]

It's worth noting that search engines back then (and now? except the AI ones) generally tended to follow robots.txt, which meant that if there were heavy areas of your site that you didn't want them to index you could filter them out and let them just follow static pages. You could block off all of /cgi-bin/ for example and then they would never be hitting your CGI scripts - useful if your guestbook software wrote out static files to be served, for example.

The search engines were also limited in resources, so they were judicious about what they fetched, when, and how often; optimizing their own crawlers saved them money, and in return it also saved the websites too. Even with a hundred crawlers actively indexing your site, they weren't going to index it more than, say, once a day, and 100 requests in a day isn't really that much even back then.

Now, companies are pumping billions of dollars into AI; budgets are infinite, limits are bypassed, and norms are ignored. If the company thinks it can benefit from indexing your site 30 times a minute then it will, but even if it doesn't benefit from it there's no reason for them to stop it from doing so because it doesn't cost them anything. They cannot risk being anything other than up-to-date, because if users are coming to you asking about current events and why space force is moving to Alabama and your AI doesn't know but someone else's does, then you're behind the times.

So in the interests of maximizing short-term profit above all else - which is the only thing AI companies are doing in any way shape or form - they may as well scrape every URL on your site once per second, because it doesn't cost them anything and they don't care if you go bankrupt and shut down.

acdha 4 days ago [ - ]

Bandwidth cost more then, so the early search engines had an inventive not to massively increase their own costs if nothing else.

ccgreg 5 days ago [ - ]

The blekko search engine index was only 1 billion pages, compared to Common Crawl Foundation's crawl of 3 billion webpages per month.

robwwilliams 4 days ago [ - ]

This! Today I asked Claude Sonnet to read the Wikipedia article on “inference” and answer a few of my questions.

Sonnet responded: “Sorry, I have no access.” Then I asked it why and it was flummoxed and confused. I asked why Anthropic did not simply maintain mirrors of Wikipedia in XX different languages and run a cron job every week.

Still no cogent answer. Pathetic. Very much an Anthropic blindspot—to the point of being at least amoral and even immoral.

Do the big AI corporation that have profited greatly from Wikimedia Foundation give anything back? Or are they just large internet blood suckers without ethics?

Dario and Sam et al.: Contribute to the welfare of your own blood donors.

giancarlostoro 4 days ago [ - ]

> Sonnet responded: “Sorry, I have no access.” Then I asked it why and it was flummoxed and confused. I asked why Anthropic did not simply maintain mirrors of Wikipedia in XX different languages and run a cron job every week.

Even worse when you consider that you can download all of Wikipedia for offline use...

8organicbits 4 days ago [ - ]

> Then I asked it why

I'm still learning the landscape of LLMs, but do we expect an LLM to be able to answer that? I didn't think they had meta information about their own operation.

kldg 3 days ago [ - ]

your understanding is correct.

lawlessone 4 days ago [ - ]

you can even torrent all of wikipedia, and a whole bunch of other wikis.

Would be great if they did that and maybe seeded it too.

jsheard 5 days ago [ - ]

Once the crawler goes up, who cares what it brings down?

That's not my department! says Crawler von Braun

zwirbl 4 days ago [ - ]

That's gold, I've just stumbled on the original a week ago

immibis 4 days ago [ - ]

It's because they don't give a shit whether the product works properly or not. By blocking AI scraping, sites are forcing AI companies to scrape faster before they're blocked. And faster means sloppier.

lovich 4 days ago [ - ]

There’s also the point that if the web site is down after you scraped it, then that’s 1 more sites data you’ve scraped that your competition now cant

benhurmarcel 4 days ago [ - ]

I guess they prefer paying for bandwidth rather than storage

add-sub-mul-div 5 days ago [ - ]

The people at the forefront of creating the shortcut machine are taking shortcuts. We're on a slow march towards the death of attention to detail.

giancarlostoro 4 days ago [ - ]

Slow march? It feels like we've been on that train a while honestly. It's embarrassing. We don't even have fully native GUIs they're all browser wrappers.

gowld 4 days ago [ - ]

Who says they don't?

numpad0 4 days ago [ - ]

imo when it kills somebody it justifies extreme means such as feeding them with fabricated truths such as LLM generated and artificially corrupted text /s