The broader problem of original sources not being given credit in a way that rewards them remains. Websites owners are paying to host their content so that spiders can come and crawl them and index it into the AI and then if they’re lucky, they might get a citation, but otherwise there’s very little reward for being a provider of content. And of course, this is something that’s getting worse and worse. Why look at a website when it’s all in AI? And then the counter to that is maybe we need to start closing the website to crawlers and put everything behind a login.

Worse, the constant AI scraping is actually costing content providers additional money for no return. At least Google/Bing/Yahoo scraping would then be used to provide links back to your content.

How do you distinguish Google/MS scraping for Gemini/Copilot vs Google Search/Bing? In the case of Google, the UA is the same and you are entirely at their mercy to honor the Google-Extended instructions in robots.txt

Google has further complicated it with new search announcement blurring lines between regular search and AI search. And AI likes to not honor any licenses or instructions when it is hungry for training material.

It is once again an example of Google using its dominant position to abuse and promote cross functional products.

If company like Meta are downloading pirated books etc.. to train their AI, they will surely honor robots.txt.

Not only costing money. Constant AI scraping constitutes a denial-of-service attack that has brought down websites.

> At least Google/Bing/Yahoo scraping would then be used to provide links back

That doesn't work anymore. Google provides AI generated summary, nobody looks at the original site.

About a year ago OpenAI crawled and go DDOS level the company I work. Even despite the robots.txt not allowing it, and despite some recaptcha we could assemble in time.

We found our data in the outputs of their models but who can do anything about it...

> We found our data in the outputs of their models but who can do anything about it...

If the crawlers refuse to voluntarily respect your robots.txt, then you are well within your rights to poison their data.

robots.txt seems like it should be a legally-binding terms of service which would make them outright copyright infringing.

Sue for $180,000 per infringement which should be calculated for each illegal API call.

Was your robots txt written by a lawyer? Does it hold up in the court?

OpenAI might in fact be a good target for stuff like this at the moment. Even if your argument is weak, they may be eager to settle generously if your suit threatens the speediness of their IPO in some way. But I happen to think this is in fact a reasonable argument: I put up a sign that says not to do something with my property, and you went ahead and did it anyway, costing me money. IANAL but seems like a straightforward tort, no?

Contracts are legally binding even if they weren't written by a lawyer. Copyright is legally binding even if no copyright claim is explicitly stated.

I looked into this a bit (not a lawyer) and it seems that robots.txt isn't legally binding to either party, but this seems to have two major implications for AI agents (and crawlers/scrapers in general).

First, even if the robots.txt says you can crawl the site, that isn't a copyright grant of any kind or permission to copy/use that data outside of the permissions granted by the TOS.

Second, ignoring the robots.txt while also pirating the site contents could point to bad-faith and makes a much stronger case for double-damage penalties due to willful infringement.

If the site TOS doesn't explicitly grant an AI agent rights to copy out the site content AND the AI agent is ignoring the robots.txt at the same time, it seems a lot more likely that there's a strong copyright infringement case against the agent owner.

It doesn't have to be written by a lawyer. The robots.txt file is an administrative directive, by the webmaster of the website, that you, being a scraper, MUST NOT go to page x and/or y, or MUST NOT go to directory z. All the law would have to say is that it is a crime to not obey these directives. It's similar to trespassing: if I put a sign that says "DO NOT ENTER" in bright red letters on a door in my apartment, or "authorized people only!", that is still legally binding and a court isn't going to care that it wasn't lawyer-authored. The court will only care that you were told to not enter that area, but did so anyway.

It doesn't matter. Robots.txt is not a license, it's a set of computer parsable directives of how programs should access your site. The actual license doesn't have to be written for computers to parse to be legally binding.

A person should be able to write in a terms of use or license page on their website that says "do not include any content from this website in your AI training data. if you do you will be billed $100 billion dollars." And it should be enforceable. It just turns out that nerds like to say "oh that would be too hard or too expensive, so we're going to ignore it."

Why hasn't your company sued OpenAI and try to argue they're violating the computer abuse and fraud act? Would it really be impossible to argue this?

Unauthorized access, system damage, and maybe even extortion all apply here.

Lawyers can. As long as that data is actually yours I mean, in a strictly legal sense.

I mean, did you check the IPs and make sure they’re from OpenAI? Obviously a fly-by-night AI company is going to set their User Agent to be from a big player.

It's actually costing them money/time! A friend of mine is a sysadmin at a university and he constantly has to deal with AI crawler DDoS-ing his servers. He said Anthropic is actually one of the worst offenders.

These AI companies are really just a gross example of the motto "Socialize the costs, privatise the profits". It's disgusting!

>Why look at a website when it's all in AI?

well, at least in the case of google, I'm pretty sure that's the point. Or at least, they are doing things that would seem to be moving towards being an oracle with all the answers and not the signpost that points you in the right direction. The destination rather than the gateway.

remember AMP?

Is it possible able to host your website in a way so that it couldn't be found via search engines (and thus wouldn't be crawlable I hope)?

I know this has repercussions on findability, but if that wasn't a concern, I'm curious how one might circumvent getting crawled.

Sure, depends on how accessibly to people you want it to be.

Most legit search engines are going to honor robots.txt and you can disallow access.

Next level would be using something like rate limiting controls and/or Cloudflare's bot fight mode to start blocking the bad bots. You start to annoy some people here.

Next would be putting the content behind some form of auth.

I don't know why we are trusting cloudflare when they are the one creating crawlers.

https://developers.cloudflare.com/browser-run/quick-actions/...

Possible yes, probable not likely. The moment you're issued a certificate your domain will be shown in the Certificate Transparency logs which are constantly monitored from anyone who wants to find new sites.

....Yet another vector through which "security experts" has caused a waterbed problem. Let's secure the Internet, oh no! We made a centralized list of operating domains for hostile actors to guide attacks with!

robots.txt is a way of leaving the door unlocked but kindly asking bots to stay outside.

You might be interested to know that entering an unlocked door into a space you do not have permission to be in is still illegal.

You might be interested to know that the “illegality” depends on the intent. If I rest on your unlocked door handle, it opens, I enter, it’s an accident.

Sorry, what? In this scenario are you claiming that you accidentally fell inside the restricted area because you were leaning on the door? Or are you claiming that you accidentally opened the door and then walked through intentionally? In the former case, you are guilty of breaking and entering in most US jurisdictions if you don’t promptly get out. Any sane court would likely agree an accidental trespass is probably not a criminal act, but it’s not an accident if you stay. In the latter case, you’re clearly trespassing illegally.

Also this has gotten pretty far away from the web scraping scenario. There’s no door accidentally opening here.

Oops, I just accidentally fell into every website. Don't know how that happened ...

Which in a law-abiding society should be enough. It's also how we do things in the real world in many cases - i.e. here you can just write on your mailbox "no ads" and companies have to respect that.

Even when we do actually put physical locks on things they are mostly there to show that someone breaking in did so intentionally and not at all designed to prevent motivated attackers.

> here you can just write on your mailbox "no ads" and companies have to respect that

Where do you live? In the US it’s actually illegal for anyone except the USPS to deliver to a mailbox.

You could just put your website content behind its own chat interface. The crawler would just see a form input for a prompt.

If you really wanted and are interested in doing so and perhaps are even happy with just text and normal styling limitations, I recommend you to test out other protocols like creating a gemini website or gopher website. I don't think that scraping happens on even remotely the same scale there as compared to conventional websites

That being said you would require your user to download a compatible browser for gemini/gopher.

I’ve been thinking of a proof-of-work scheme for accessing content where you effectively need to mine some crypto for the author, but, this idea might not fly today

But that will be a hassle for human visitors as well. A web doing proof-of-work to browse, will be a disaster for phones with their limited batteries, etc.

To be specific, it would be more of a hassle for human visitors than for the AI companies with infinite money and specialized browsers.

The idea would be that AI companies would still be forced to do this proof of work. Anubis proved the idea

This is already a thing.

https://en.wikipedia.org/wiki/Anubis_(software)

Yes, but:

> Although Anubis could be altered to mine cryptocurrency to serve as proof of work, Iaso has rejected this idea: "I don't want to touch cryptocurrency with a 20 foot pole."

Which in my mind is a shame. Crypto is an absolute mess, yes, but this seems like an elegant way to get something back for putting things out there.

Mining crypro doesn't materialize money. You have to exchange it for real money which means taking a private individual's money in exchange for scam tokens.

This is the problem crypto fans refuse to acknowledge. The money doesn't magically appear, you're taking it from someone else and letting them hold the bag when whatever cryptocurrency you choose inevitably blows up, fails, or rug-pulls. It's unethical to engage with at all because you're still participating in scamming real money out of private individuals

Not necessarily. You can spend your cryptocoins with any number of businesses and it is very much the choice of those businesses to accept them or not. No private individuals need be involved.

Note also that any non-crypto currency can also devalue at any moment, although perhaps not to the same extent. Holding anything of any perceived value carries a risk and also a potential reward.

The problem is that much of the cost is borne by humans accessing the sites. People generally get real mad when they find out you’re using their computers to mine crypto.

or you know, just charge for your content if you believe it to be valuable enough for the fee being charged.

Yes, but that tends to limit the reach of your content. Hence why a lot of people reach for ads.

Between seeing ads and doing a little bit of proof-of-work for the author, I'd choose the latter.

I agree with this whole heartedly. What's the point of even having copyright law at this point?

What's even crazier to think about is that to use the latest versions of these models for which you supplied training data, you have to pay hundreds of dollars a month. I would love to get a settlement check proportional to my model weights. Even if it's $0.10, at least everyone out there will get what they're owed.

From my perspective, everybody trains on the knowledge and experience of those who came before. AI just does the same thing at scale.

I do not value copyright. All it does is give you standing to sue if somebody reproduces your work. It does not differentiate or account for parallel creation. I cannot count how many times I have "created" something, only to find it in a research paper later.

Part of the reason I think copyright has no value is that, in general, individual copyright owners don't have the deep pockets necessary to sue someone who violates their copyright. If anyone is violating the spirit of copyright, it's corporations that insist you assign your work over to them as a work for hire, or outright ignore your copyright. (looking at you, Disney's Atlantis).

A significant benefit of AI that doesn't get talked about enough is that AI has a much greater reach over all the information it was trained on and can draw connections that would be invisible to someone operating at the human scale.

The fact that these companies are making money off of it negates your argument.

I don't think anyone's "making money" yet. We have a race to build up hardware for AI, and one to train models. There are some profits in there, but who's making money from the work AI performs? Nobody, because any advantage some company claims with AI is quickly replicated by competitors and profit dries up.

Today you can put a coding agent to migrate an existing application to another language (like chardet). Even if you don't have the code, if you can run the app you can still clone it, using it as an oracle for replication. That is why there will be very little profits in AI usage.

I get what you’re saying but that’s irrelevant to the argument.

They are indeed taking in money by selling the product. Just because they don’t turn a profit doesn’t mean they’re not infringing copyright as a business practice to make money.

No, you don’t have to. There are open weight models you can download and use for free. Many people choose the subscription model but it’s not necessary. And latest doesn’t mean greatest, it’s just most up-to-date.

It's never been a problem with people ad-blocking for the last 20 years, why is it suddenly a problem now?

We've been celebrating denying creators revenue for decades...

Maybe this is just the internet hypocricy of "When I do it, it's good, when they do it, it's bad".

Total sleight of hand.

Ad blocking has always been a problem for creators but it's aimed at big corps - non-creators. The creators asked people to support them other ways or turn off the blocking. And it's not like the little independent creators wanted this version of commercialized internet in the first place.

The ai marketing teams are spinning everything they can but no AI companies are the conscript, the vultures. No question about it.

The conversion from viewer to donator is around 1%. This is true from wikipedia, to twitch, to podcasts.

The number of people who will not ever load your ads is around 30%.

I can tell you that creators talk about this a lot in private, but will not publicly because the internet has a mass delusion on how creation and compensation works. It's like trying to convince christians that jesus obviously didn't come back from the dead days later, depsite there being no logical system available that would explain it.

If we were to try and map out a functional internet where everyone wins, users and creators, there is no example where ad blocking is anything other net harmful. You either get volunteer net where 0.01% share hobby posts on their own dime for the other 99.9% or you get IRC where 99% of the population doesn't really benefit (ala 1993).

The problem is that the ad vendors couldn't keep it in their pants. The ads you're talking about are a common vector for delivering malware onto people's PCs, and absolutely destroy the usability of sites. Between tracking cookies, popups, full screen banners, autoplaying video, flashing ads, and their unbelievably high weight in bandwidth - the internet is fairly unusable if you don't block any ads

Bear in mind that many basic privacy features destroy ads by breaking tracking and fingerprinting. Its impossible to get a browser in that doesn't filter out behaviours that have been used to deliver ads

Creatives can and have adapted their strategies away from what is a very specific form of ads: the disruptive full screen ads, or banner ads. That's only one form of advertising that everyone utterly detests. Sponsored content is much more popular with the end users, and much more effective as well because its way less disruptive. Some people hate that, but overall the tradeoff is significantly better

We shouldn't confuse a single type of widely blocked advert with all advertising being blocked. Banner ads have very poor efficacy at delivering sales anyway

I use ad blockers on my personal computer and phone to avoid tracking. My work computer doesn't have a blocker, but I only visit "professional" sites and major blog aggregators on it, so those ads aren't egregious. Ad blockers wouldn't have become a thing of it weren't for ads causing terrible layout, poor performance, and annoying interruptions when playing sound. Not every website does it, but the ones that do have poisoned the well.

People usually point at the scale when this discussion comes up, in my experience. These companies are doing something at a huge scale spending tons of money to do it so the potential harm is greater.

People can easily justify their own piracy because it’s small scale. Even when they organize, create a whole software and tooling ecosystem around pirating media to stick into jellyfin or plex. AI still did it bigger and worse and is bad, what I’m doing is not so bad because I wasn’t going to buy the movie anyway, etc.

On the whole, about 35% of internet users are ad-blocking. In the tech space it's upwards of 70%.

It's in no way, shape, or form "small scale", and has fundamentally changed the the very nature of the internet for the worse (opinions/views of ad blocking people don't matter).

Don't forget that the money being spent to do said scraping has, in great sums, come from subsidies paid by taxes from public coffers.

Choosing not to look at something is not denying anyone anything.

Choosing not to look at an ad, and blocking it are different things. One is totally ok, the other incurs a monetary loss on the creator. Those services aren't free to run, and the content doesn't take zero time to create. It also incentivizes creating content focused on those who cannot figure out ad blocking.

I am in favor of severely limiting both copyright and advertising, but for the benefit of everyone, not just for the benefit of a few "AI" companies.

And you will not get it. As the AI pump money into lawyers and politicians - they will be the ones profiting from copyright. Total regulatory capture as US AI companies make it illegal to train AI on their output.

The answer is to simply pay for stuff.

There is no viable model where "have stuff but not pay for it" works out.

There is more to life than money.

Many of the websites I read do not collect any appreciable amount of money from ads, or have no ads at all (one example: news.ycombinator.com :) ). They want a recognition, or to share the knowledge, or community, or they are building their brand... And AI is destroying this all - the first result of "zx80" is an AI overview with a link to wikipedia and some youtube videos. If person stops there , they will never get to computinghistory.org.uk link, and won't see any related information about the variants and models.

This website is an ad for Ycombinator. It's in no way, shape, or form a charity place for devs to hang out. It's a feeding ground to lure tech people into a mega VCs pastures.

When you click "news.ycombinator.com" you are clicking on the ad.

:)

Interesting. I suppose the main difference is that we’re ants compared to an 800 pound gorilla.

[dead]

Perhaps we should go back to back when the internet was about sharing information you liked, not about credit or making money on "content".

Ok, AI companies first then since they are some of the biggest offenders

You are there today, but some are unhappy that others don’t share the same sentiment.