Hacker News

ctoth 6 days ago [ - ]

The web doesn't need attestation. It doesn't need signed agents. It doesn't need Cloudflare deciding who's a "real" user agent. It needs people to remember that "public" means PUBLIC and implement basic damn rate limiting if they can't handle the traffic.

The web doesn't need to know if you're a human, a bot, or a dog. It just needs to serve bytes to whoever asks, within reasonable resource constraints. That's it. That's the open web. You'll miss it when it's gone.

johncolanduoni 6 days ago [ - ]

Basic damn rate limiting is pretty damn exploitable. Even ignoring botnets (which is impossible), usefully rate limiting IPv6 is anything but basic. If you just pick some prefix from /48 to /64 to key your rate limits on, you'll either be exploitable by IPs from providers that hand out /48s like candy or you'll bucket a ton of mobile users together for a single rate limit.

ctoth 6 days ago [ - ]

You make unauthenticated requests cheap enough that you don't care about volume. Reserve rate limiting for authenticated users where you have real identity. The open web survives by being genuinely free to serve, not by trying to guess who's "real."

A basic Varnish setup should get you most of the way there, no agent signing required!

hombre_fatal 6 days ago [ - ]

Your response to unauthenticated requests could be <h1>Hello world</h1> served from memory and your server/link will still fail under a volumetric attack, and you still get the pleasure of paying for the bandwidth.

So no, this advice has been outdated for decades.

Also you're doing some sort of victim blaming where everyone on earth has to engineer their service to withstand DoS instead of outsourcing that to someone else. Abusers outsource their attacks to everyone else's machine (decentralization ftw!), but victims can't outsource their defense because centralization goes against your ideals.

At least lament the naive infrastructure of the internet or something, sheesh.

ctoth 6 days ago [ - ]

We started with "AI crawlers are too aggressive" and you've escalated to volumetric DDoS. These aren't the same problem. OpenAI hitting your API too hard is solved by caching, not by Cloudflare deciding who gets an "agent passport."

"Victim blaming"? Can we please leave these therapy-speak terms back in the 2010s where they belong and out of technical discussions? If expecting basic caching is victim blaming, then so is expecting HTTPS, password hashing, or any technical competence whatsoever.

Your decentralization point actually proves mine: yes, attackers distribute while defenders centralize. That's why we shouldn't make centralization mandatory! Right now you can choose Cloudflare. With attestation, they become the web's border control.

The fine article makes it clear what this is really about - Cloudflare wants to be the gatekeeper for agent traffic. Agent attestation doesn't solve volumetric attacks (those need the DDoS protection they already sell, no new proposal required!) They're creating an allowlist where they decide who's "legitimate."

But sure, let's restructure the entire web's trust model because some sites can't configure a cache. That seems proportional.

danudey 6 days ago [ - ]

OpenAI hitting your static, cached pages too hard and costing you terabytes of extra bandwidth that you have to pay for (both in bandwidth itself and data transfer fees) isn't solved by caching.

The post you're replying to points out that, at a certain scale, even caching things in-memory can cause your system to fall over when a user agent (e.g. AI scraper bots) are behaving like bad actors, ignoring robots.txt, and fetching every URL twenty times a day while completely ignoring cache headers/last modified/etc.

Your points were all valid when we were dealing with either "legitimate users", "legitimate good-faith bots", and "bad actors", but now the AI companies' need for massive amounts of up-to-the-minute content at all costs means that we have to add "legitimate bad-faith bots" to the mix.

> Agent attestation doesn't solve volumetric attacks (those need the DDoS protection they already sell, no new proposal required!) They're creating an allowlist where they decide who's "legitimate."

Agent attestation solves overzealous AI scraping which looks like a volumetric attack, because if you refuse to provide the content to the bots then the bots will leave you alone (or at least, they won't chew up your bandwidth by re-fetching the same content over and over all day).

hombre_fatal 6 days ago [ - ]

Well, your post escalated to the broad claim that I responded to.

You didn't just disagree with AI crawler attestation: you're saying that nobody should distinguish earnest users from everything else because they should bear the cost of serving both, which necessarily entails bad traffic and incidental DoS.

Once again, services like CloudFlare exist because a cache isn't sufficient to deal with arbitrary traffic, and the scale of modern abuse is so large that only a few megacorps can provide the service that people want.

6 days ago [ - ]

[deleted]

Lammy 6 days ago [ - ]

> You make unauthenticated requests cheap enough that you don't care about volume.

In the days before mandatory TLS it was so easy to set up a Squid proxy on the edge of my network and cache every plain-HTTP resource for as long as I want.

Like yeah, yeah, sure, it sucked that ISPs could inject trackers and stuff into page contents, but I'm starting to think the downsides of mandatory TLS outweigh the upsides. We made the web more Secure at the cost of making it less Private. We got Google Analytics and all the other spyware running over TLS and simultaneously made it that much harder for any normal person to host anything online.

AnthonyMouse 6 days ago [ - ]

You can still do that, you have the caching reverse proxy at the edge of the network be the thing that terminates TLS.

Lammy 6 days ago [ - ]

Not really. At minimum you will break all of these sites on the HSTS preload list: https://source.chromium.org/chromium/chromium/src/+/main:net...

AnthonyMouse 5 days ago [ - ]

It isn't the client side who does this, it's the server side. Doing it on the client side has a nominal benefit in the typical case but is very little value to you when the problem is some misbehaving third party AI scraper taking down the server when you need to get something from it that isn't already in the local cache.

If you have three local machines, you might be able to turn three queries into one, assuming they all visit the same site instead of different people using different sites.

If you do this on the server, a request that requires the execution of PHP code and three SQL queries goes from happening on every request for the same resource to happening once and then the subsequent requests are just shoveling the cached response back out the pipe instead of having to process it again. Instead of reducing the number of requests that reach the back end by 3:1 you reduce it by a million to one.

And that doesn't cause any HSTS problems because a reverse proxy operated by the site owner has the real certificate in it.

TheCycoONE 6 days ago [ - ]

Public key pinning was rejected so you just need your proxy to also supply a certificate that's trusted by your clients.

johncolanduoni 6 days ago [ - ]

I guess you should start a Cloudflare competitor that just puts a cheap Varnish VM in front of websites to solve bots forever.

nitwit005 6 days ago [ - ]

What you're proposing is that a lot of small websites should simply shut down, in the name of the open internet. The goals seem self contradictory.

1vuio0pswjnm7 6 days ago [ - ]

"It needs people to remember that "public" means PUBLIC and implement basic damn rate limiting if they can't handle the traffic."

And publish the acceptable rate.

But anyone who has ever been blocked for sending a _single_ HTTP request with the "wrong" user-agent string knows that the issue website operators are worried about is not necessarily rate (behaviour). Website operators routinely believe there is no such thing as a well-behaved bot. Thus they disregard behaviour and only focus on identity. If their crude heuristics with high probability of false positives suggest "bot" as the identity then their decision is to block, irrespective of behaviour, and ignore any possibility the heuristics may have failed. Operators routinely make (incorrect) assumptions about intent based on identity not behaviour.

fooey 6 days ago [ - ]

Modern AI crawlers are indistinguishable from malicious botnets. There's no longer any rate limiting strategy that's effective, that's entirely the point of what cloudflare is attempting to solve

zzo38computer 6 days ago [ - ]

Yes, I think that you are right (although rate limiting can sometimes be difficult to work properly).

Delegation of authorization can be useful for things that require it (as in some of the examples given in the article), but public files should not require authorization nor authentication for accessing it. Even if delegation of authorization is helpful for some uses, Cloudflare (or anyone else, other than whoever is delegating the authorization) does not need to be involved in them.

danudey 6 days ago [ - ]

> public files should not require authorization nor authentication for accessing it

Define "public files" in this case?

If I have a server with files, those are my private files. If I choose to make them accessible to the world then that's fine, but they're still private files and no one else has a right to access them except under the conditions that I set.

What Cloudflare is suggesting is that content owners (such as myself, HN, the New York Times, etc.) should be provided with the tools to restrict access to their content if unfettered access to all people is burdensome to them. For example, if AI scraper bots are running up your bandwidth bill or server load, shouldn't you be able to stop them? I would argue yes.

And yet you can't. These AI bots will ignore your robots.txt, they'll change user agents if you start to block their user agents, they'll use different IP subnets if you start to block IP subnets. They behave like extremely bad actors and ignore every single way you can tell them that they're not welcome. They take and take and provide nothing in return, and they'll do so until your website collapses under the weight and your readers or users leave to go somewhere else.

zzo38computer 6 days ago [ - ]

> For example, if AI scraper bots are running up your bandwidth bill or server load, shouldn't you be able to stop them? I would argue yes

I also say yes, but this is not because of a lack of authorization; it is because of excessive server load (which is what you describe).

Allowing other public mirrors of files would be one thing that can be helpful (providing archive files might also sometimes be useful), although that does not actually prevent excessive scraping, due to their bad working (which is also what you describe).

Some people may use Cloudflare, but Cloudflare has its own problems with it; a lot of legitimate accessing is also stopped, while not necessarily preventing all illegitimate accessing, and sometimes causing additional problems (sometimes this might be due to misconfiguration, but not necessarily always).

> These AI bots will ignore your robots.txt, they'll change user agents if you start to block their user agents, they'll use different IP subnets if you start to block IP subnets

In my experience they change user agents and IP subnets whether or not you block them, and regardless of what else you might do.

6 days ago [ - ]

[deleted]

ares623 6 days ago [ - ]

> within reasonable resource constraints

And let’s all hold hands and sing koombaya

6 days ago [ - ]

[deleted]

p3rls 6 days ago [ - ]

[flagged]