I own a forum which currently has 23k online users, all of them bots. The last new post in that forum is from _2019_. Its topic is also very niche. Why are so many bots there? This site should have basically been scraped a million times by now, yet those bots seem to fetch the stuff live, on the fly? I don’t get it.
I have a site with a complete and accurate sitemap.xml describing when its ~6k pages are last updated (on average, maybe weekly or monthly). What do the bots do? They scrape every page continuously 24/7, because of course they do. The amount of waste going into this AI craze is just obscene. It's not even good content.
It would be interesting if someone made a map that depicts the locations of the ip addresses that are sending so many requests, over the course of a day maybe.
https://news.ycombinator.com/item?id=46241849
Maps That Are Just Datacenters
If you are in the US, have you considered suing them for robot.txt / copyright violation? AI companies are currently flush with cash from VCs and there may be a few big law firms willing to fight a law suit against them on your behalf. AI companies have already lost some copyright cases.
Based upon traffic you could tell whether an IP or request structure is coming from a not, but how would you reliability tell which company is DDOSing you?
It should be at least theoretically possible: each IP address is assigned to an organisation running the IP routing prefix, and you can look that up easily, and they should have some sort of abuse channel, or at the very least a legal system should be able to compel them to cooperate and give up the information they’re required to have.
Large scale scraping tech is not as sophisticated as you'd think. A significant chunk of it is "get as much as possible, categorize and clean up later". Man, I really want the real web of the 2000s back, when things felt "real" more or less... how can we even get there.
Have you ever listened to the 'high water mark' monologue from fear and loathing? It's pretty much just that. It was a unique time and it was neat that we got to see it, but it can't possibly happen again.
https://www.youtube.com/watch?v=vUgs2O7Okqc
Thanks for reminding me about that, what a great monologue. I didn't really understand it when I was younger, but now I feel the same thing with regards to software engineering. There was a golden age which finally broke at the end of the 2010's.
A curated web directory. Kind of like Yahoo had. The internet according to the dewey system with pages somehow rated for quality by actual humans (maybe something to learn from Wikipedia's approach here?)
If people start making search engines again and there is more competition for Google, I think things would be pretty sweet.
Because of the financial incentives, it would still end up with people doing things to drive traffic to their website though, no? Maybe because the web was smaller, and people looked at it as means "to explore curiosity" in the olden days it kinda worked differently... maybe I just got old, but I don't want to believe that.
By “doing things to drive traffic to their website” do you mean trying to do SEO type things to manipulate search engine rankings? If so, I think that there are probably ways to rank that are immune to tampering.
Don’t worry, you’re not just old. The internet kind of sucks now.
Google was neat in that you didn't see the content keyword spam either on the websites or the portal home pages. The Web was already full of shit (first ad banner was 1994? By 1999 you already had punch the monkey as classy content), but it was more ... organic and you could easily skip it.
There are other search engines, they've just been marginalised. Even something as mainstream as Bing has been pushed to the side.
it's few orders of magnitude harder given the amount of SEO spam prevalent, and that just gonna get worse with AI
I would understand that, but it seems they don’t store the stuff but recollect the same content every hour.
I'm assuming a quick hash check to see if there's any change? Between scrapers "most up to date data" is fairly valuable nowadays as well.
The bots are exposing themselves as Google, Bing and Yandex. I can’t verify whether it’s being attributed by IP address or whether the forum trusts their user agent. It could basically be anyone.
Interesting. When it was just normal search engines I didn’t hear of people having this problem, so this either means that there are a bunch of people pretending to be bing google and yandex, or those companies have gotten a lot more aggressive.
There are lots of people pretending to be Google and friends. They far outnumber the real Googlebot, etc. and most people don't check the reverse DNS/IP list - it's tedious to do this for even well-behaved crawlers that publish how to ID themselves. So much for User Agent.
> So much for User Agent.
User agent has been abused for so long, I forget a time when it wasn't.
Anyone else remember having to fake being a Windows machine so that YouTube/Netflix would serve you content better than standard def, or banking portals that blocked you if your agent didn't say you were Internet Explorer?
I mean forget that, all modern desktop browsers (at least) start with the string 'Mozilla/5.0', still, in a world where Chrome is so dominant.
What are the proportions for the attributions? Is it equally distributed or lopsided towards one of the three?
Normal search engine spiders did/do cause problems but not on the scale of AI scrapers. Search engine spiders tend to follow a robots.txt, look at the sitemap.xml, and generally try to throttle requests. You'll find some that are poorly behaved but they tend to get blocked and either die out or get fixed and behave better.
The AI scrapers are atrocious. They just blindly blast every URL on a site with no throttling. They are terribly written and managed as the same scraper will hit the same site multiple times a day or even hour. They also don't pay any attention to context so they'll happily blast git repo hosts and hit expensive endpoints.
They're like a constant DOS attack. They're hard to block at the network level because they span across different hyperscalers' IP blocks.
Puts on tinfoil hat: Maybe it isn’t AI scrapers, but actually is a massive dos attack, and it’s a conspiracy to get people to not self-host.
How do you define a user, and how do you define online?
If the forum considers unique cookies to be a user and creates a new cookie for any new cookie-less request, and if it considers a user to be online for 1 hour after their last request, then actually this may be one scraper making ~6 requests per second. That may be a pain in its own way, but it's far from 23k online bots.
That's still 518.400 requests per day. For static content. And it's a niche forum, so it's not exactly going to have millions of pages.
Either there are indeed hundreds or thousands of AI bots DDoSing the entire internet, or a couple of bots are needlessly hammering it over and over and over again. I'm not sure which option is worse.
Imagine if all this scraping was going into a search engine with a massive index, or a bunch of smaller search engines that a meta-search engine could be made for. This’d be a lot more cool in that case
AFAIK it keeps a user counted as online for 5 or 15 minutes (I think 5). It’s a Woltlab Burning Board.
Edit: it’s 15 minutes.
And what is a "user"?
Whatever the forum software Woltlab Burning Board considers a user. If I recall correctly, it tries to build an identifier based on PHP session ids, so most likely simply cookies.
This is exactly my point. Scrapers typically don't store cookies, so every single request is likely to be a "new" user as far as the forum software is concerned.
Couple that with 15 minute session times, and that could just be one entity scraping the forum at 30 requests per second. One scraper going moderately fast sounds far less bad than 29000 bots.
It still sounds excessive for a niche site, but I'd guess this is sporadic, or that the forum software has a page structure that traps scrapers accidentally, quite easy to do.
Why pay for storage when you do it for them?
I'd love to know the answer to this question. AI scrapers wanting everything on the internet makes sense to me. But I don't understand how that leads to every site being hit hundreds of thousands of times per day.
Why do you keep it operating? Is it the aquarium value?
When you have trillions of dollars being poured into your company by the financial system, and when furthermore there are no repercussions for behaving however you please, you tend not to care about that sort of "waste".
Sure you do by now. You are the hard drive.
Are you sure the counter is not broken?
Yes, it’s running on a Woltlab Burning Board since forever.