Hacker News

Yes, I think that you are right (although rate limiting can sometimes be difficult to work properly).

Delegation of authorization can be useful for things that require it (as in some of the examples given in the article), but public files should not require authorization nor authentication for accessing it. Even if delegation of authorization is helpful for some uses, Cloudflare (or anyone else, other than whoever is delegating the authorization) does not need to be involved in them.

danudey 6 days ago [ - ]

> public files should not require authorization nor authentication for accessing it

Define "public files" in this case?

If I have a server with files, those are my private files. If I choose to make them accessible to the world then that's fine, but they're still private files and no one else has a right to access them except under the conditions that I set.

What Cloudflare is suggesting is that content owners (such as myself, HN, the New York Times, etc.) should be provided with the tools to restrict access to their content if unfettered access to all people is burdensome to them. For example, if AI scraper bots are running up your bandwidth bill or server load, shouldn't you be able to stop them? I would argue yes.

And yet you can't. These AI bots will ignore your robots.txt, they'll change user agents if you start to block their user agents, they'll use different IP subnets if you start to block IP subnets. They behave like extremely bad actors and ignore every single way you can tell them that they're not welcome. They take and take and provide nothing in return, and they'll do so until your website collapses under the weight and your readers or users leave to go somewhere else.

zzo38computer 6 days ago [ - ]

> For example, if AI scraper bots are running up your bandwidth bill or server load, shouldn't you be able to stop them? I would argue yes

I also say yes, but this is not because of a lack of authorization; it is because of excessive server load (which is what you describe).

Allowing other public mirrors of files would be one thing that can be helpful (providing archive files might also sometimes be useful), although that does not actually prevent excessive scraping, due to their bad working (which is also what you describe).

Some people may use Cloudflare, but Cloudflare has its own problems with it; a lot of legitimate accessing is also stopped, while not necessarily preventing all illegitimate accessing, and sometimes causing additional problems (sometimes this might be due to misconfiguration, but not necessarily always).

> These AI bots will ignore your robots.txt, they'll change user agents if you start to block their user agents, they'll use different IP subnets if you start to block IP subnets

In my experience they change user agents and IP subnets whether or not you block them, and regardless of what else you might do.