Hacker News

thethingundone 4 days ago [ - ]

I own a forum which currently has 23k online users, all of them bots. The last new post in that forum is from _2019_. Its topic is also very niche. Why are so many bots there? This site should have basically been scraped a million times by now, yet those bots seem to fetch the stuff live, on the fly? I don’t get it.

sethops1 4 days ago [ - ]

I have a site with a complete and accurate sitemap.xml describing when its ~6k pages are last updated (on average, maybe weekly or monthly). What do the bots do? They scrape every page continuously 24/7, because of course they do. The amount of waste going into this AI craze is just obscene. It's not even good content.

n1xis10t 4 days ago [ - ]

It would be interesting if someone made a map that depicts the locations of the ip addresses that are sending so many requests, over the course of a day maybe.

GoblinSlayer 2 days ago [ - ]

https://news.ycombinator.com/item?id=46241849

giantrobot 4 days ago [ - ]

Maps That Are Just Datacenters

thisislife2 4 days ago [ - ]

If you are in the US, have you considered suing them for robot.txt / copyright violation? AI companies are currently flush with cash from VCs and there may be a few big law firms willing to fight a law suit against them on your behalf. AI companies have already lost some copyright cases.

happymellon 4 days ago [ - ]

Based upon traffic you could tell whether an IP or request structure is coming from a not, but how would you reliability tell which company is DDOSing you?

chrismorgan 4 days ago [ - ]

It should be at least theoretically possible: each IP address is assigned to an organisation running the IP routing prefix, and you can look that up easily, and they should have some sort of abuse channel, or at the very least a legal system should be able to compel them to cooperate and give up the information they’re required to have.

tokioyoyo 4 days ago [ - ]

Large scale scraping tech is not as sophisticated as you'd think. A significant chunk of it is "get as much as possible, categorize and clean up later". Man, I really want the real web of the 2000s back, when things felt "real" more or less... how can we even get there.

idiotsecant 4 days ago [ - ]

Have you ever listened to the 'high water mark' monologue from fear and loathing? It's pretty much just that. It was a unique time and it was neat that we got to see it, but it can't possibly happen again.

https://www.youtube.com/watch?v=vUgs2O7Okqc

symbogra 4 days ago [ - ]

Thanks for reminding me about that, what a great monologue. I didn't really understand it when I was younger, but now I feel the same thing with regards to software engineering. There was a golden age which finally broke at the end of the 2010's.

tmnvix 3 days ago [ - ]

A curated web directory. Kind of like Yahoo had. The internet according to the dewey system with pages somehow rated for quality by actual humans (maybe something to learn from Wikipedia's approach here?)

n1xis10t 4 days ago [ - ]

If people start making search engines again and there is more competition for Google, I think things would be pretty sweet.

tokioyoyo 4 days ago [ - ]

Because of the financial incentives, it would still end up with people doing things to drive traffic to their website though, no? Maybe because the web was smaller, and people looked at it as means "to explore curiosity" in the olden days it kinda worked differently... maybe I just got old, but I don't want to believe that.

n1xis10t 4 days ago [ - ]

By “doing things to drive traffic to their website” do you mean trying to do SEO type things to manipulate search engine rankings? If so, I think that there are probably ways to rank that are immune to tampering.

Don’t worry, you’re not just old. The internet kind of sucks now.

makapuf 4 days ago [ - ]

Google was neat in that you didn't see the content keyword spam either on the websites or the portal home pages. The Web was already full of shit (first ad banner was 1994? By 1999 you already had punch the monkey as classy content), but it was more ... organic and you could easily skip it.

nephihaha 3 days ago [ - ]

There are other search engines, they've just been marginalised. Even something as mainstream as Bing has been pushed to the side.

PunchyHamster 4 days ago [ - ]

it's few orders of magnitude harder given the amount of SEO spam prevalent, and that just gonna get worse with AI

thethingundone 4 days ago [ - ]

I would understand that, but it seems they don’t store the stuff but recollect the same content every hour.

tokioyoyo 4 days ago [ - ]

I'm assuming a quick hash check to see if there's any change? Between scrapers "most up to date data" is fairly valuable nowadays as well.

thethingundone 4 days ago [ - ]

The bots are exposing themselves as Google, Bing and Yandex. I can’t verify whether it’s being attributed by IP address or whether the forum trusts their user agent. It could basically be anyone.

n1xis10t 4 days ago [ - ]

Interesting. When it was just normal search engines I didn’t hear of people having this problem, so this either means that there are a bunch of people pretending to be bing google and yandex, or those companies have gotten a lot more aggressive.

bobbiechen 4 days ago [ - ]

There are lots of people pretending to be Google and friends. They far outnumber the real Googlebot, etc. and most people don't check the reverse DNS/IP list - it's tedious to do this for even well-behaved crawlers that publish how to ID themselves. So much for User Agent.

happymellon 4 days ago [ - ]

> So much for User Agent.

User agent has been abused for so long, I forget a time when it wasn't.

Anyone else remember having to fake being a Windows machine so that YouTube/Netflix would serve you content better than standard def, or banking portals that blocked you if your agent didn't say you were Internet Explorer?

wooger 4 days ago [ - ]

I mean forget that, all modern desktop browsers (at least) start with the string 'Mozilla/5.0', still, in a world where Chrome is so dominant.

reallyhuh 4 days ago [ - ]

What are the proportions for the attributions? Is it equally distributed or lopsided towards one of the three?

giantrobot 4 days ago [ - ]

Normal search engine spiders did/do cause problems but not on the scale of AI scrapers. Search engine spiders tend to follow a robots.txt, look at the sitemap.xml, and generally try to throttle requests. You'll find some that are poorly behaved but they tend to get blocked and either die out or get fixed and behave better.

The AI scrapers are atrocious. They just blindly blast every URL on a site with no throttling. They are terribly written and managed as the same scraper will hit the same site multiple times a day or even hour. They also don't pay any attention to context so they'll happily blast git repo hosts and hit expensive endpoints.

They're like a constant DOS attack. They're hard to block at the network level because they span across different hyperscalers' IP blocks.

n1xis10t 4 days ago [ - ]

Puts on tinfoil hat: Maybe it isn’t AI scrapers, but actually is a massive dos attack, and it’s a conspiracy to get people to not self-host.

danpalmer 4 days ago [ - ]

How do you define a user, and how do you define online?

If the forum considers unique cookies to be a user and creates a new cookie for any new cookie-less request, and if it considers a user to be online for 1 hour after their last request, then actually this may be one scraper making ~6 requests per second. That may be a pain in its own way, but it's far from 23k online bots.

crote 4 days ago [ - ]

That's still 518.400 requests per day. For static content. And it's a niche forum, so it's not exactly going to have millions of pages.

Either there are indeed hundreds or thousands of AI bots DDoSing the entire internet, or a couple of bots are needlessly hammering it over and over and over again. I'm not sure which option is worse.

n1xis10t 4 days ago [ - ]

Imagine if all this scraping was going into a search engine with a massive index, or a bunch of smaller search engines that a meta-search engine could be made for. This’d be a lot more cool in that case

thethingundone 4 days ago [ - ]

AFAIK it keeps a user counted as online for 5 or 15 minutes (I think 5). It’s a Woltlab Burning Board.

Edit: it’s 15 minutes.

danpalmer 4 days ago [ - ]

And what is a "user"?

thethingundone 4 days ago [ - ]

Whatever the forum software Woltlab Burning Board considers a user. If I recall correctly, it tries to build an identifier based on PHP session ids, so most likely simply cookies.

danpalmer 4 days ago [ - ]

This is exactly my point. Scrapers typically don't store cookies, so every single request is likely to be a "new" user as far as the forum software is concerned.

Couple that with 15 minute session times, and that could just be one entity scraping the forum at 30 requests per second. One scraper going moderately fast sounds far less bad than 29000 bots.

It still sounds excessive for a niche site, but I'd guess this is sporadic, or that the forum software has a page structure that traps scrapers accidentally, quite easy to do.

mrweasel 4 days ago [ - ]

Why pay for storage when you do it for them?

stevage 3 days ago [ - ]

I'd love to know the answer to this question. AI scrapers wanting everything on the internet makes sense to me. But I don't understand how that leads to every site being hit hundreds of thousands of times per day.

GaryBluto 3 days ago [ - ]

Why do you keep it operating? Is it the aquarium value?

andrepd 4 days ago [ - ]

When you have trillions of dollars being poured into your company by the financial system, and when furthermore there are no repercussions for behaving however you please, you tend not to care about that sort of "waste".

csomar 3 days ago [ - ]

Sure you do by now. You are the hard drive.

sandblast 4 days ago [ - ]

Are you sure the counter is not broken?

thethingundone 4 days ago [ - ]

Yes, it’s running on a Woltlab Burning Board since forever.