Hacker News

dvduval 4 hours ago [ - ]

The broader problem of original sources not being given credit in a way that rewards them remains. Websites owners are paying to host their content so that spiders can come and crawl them and index it into the AI and then if they’re lucky, they might get a citation, but otherwise there’s very little reward for being a provider of content. And of course, this is something that’s getting worse and worse. Why look at a website when it’s all in AI? And then the counter to that is maybe we need to start closing the website to crawlers and put everything behind a login.

Ensorceled 4 hours ago [ - ]

Worse, the constant AI scraping is actually costing content providers additional money for no return. At least Google/Bing/Yahoo scraping would then be used to provide links back to your content.

devsda 2 hours ago [ - ]

How do you distinguish Google/MS scraping for Gemini/Copilot vs Google Search/Bing? In the case of Google, the UA is the same and you are entirely at their mercy to honor the Google-Extended instructions in robots.txt

Google has further complicated it with new search announcement blurring lines between regular search and AI search. And AI likes to not honor any licenses or instructions when it is hungry for training material.

It is once again an example of Google using its dominant position to abuse and promote cross functional products.

cute_boi an hour ago [ - ]

If company like Meta are downloading pirated books etc.. to train their AI, they will surely honor robots.txt.

bolangi 3 hours ago [ - ]

Not only costing money. Constant AI scraping constitutes a denial-of-service attack that has brought down websites.

fiedzia 3 hours ago [ - ]

> At least Google/Bing/Yahoo scraping would then be used to provide links back

That doesn't work anymore. Google provides AI generated summary, nobody looks at the original site.

motbus3 4 hours ago [ - ]

About a year ago OpenAI crawled and go DDOS level the company I work. Even despite the robots.txt not allowing it, and despite some recaptcha we could assemble in time.

We found our data in the outputs of their models but who can do anything about it...

kibwen 3 hours ago [ - ]

> We found our data in the outputs of their models but who can do anything about it...

If the crawlers refuse to voluntarily respect your robots.txt, then you are well within your rights to poison their data.

hajile 3 hours ago [ - ]

robots.txt seems like it should be a legally-binding terms of service which would make them outright copyright infringing.

Sue for $180,000 per infringement which should be calculated for each illegal API call.

throw1234567891 2 hours ago [ - ]

Was your robots txt written by a lawyer? Does it hold up in the court?

ElevenLathe an hour ago [ - ]

OpenAI might in fact be a good target for stuff like this at the moment. Even if your argument is weak, they may be eager to settle generously if your suit threatens the speediness of their IPO in some way. But I happen to think this is in fact a reasonable argument: I put up a sign that says not to do something with my property, and you went ahead and did it anyway, costing me money. IANAL but seems like a straightforward tort, no?

hajile an hour ago [ - ]

Contracts are legally binding even if they weren't written by a lawyer. Copyright is legally binding even if no copyright claim is explicitly stated.

I looked into this a bit (not a lawyer) and it seems that robots.txt isn't legally binding to either party, but this seems to have two major implications for AI agents (and crawlers/scrapers in general).

First, even if the robots.txt says you can crawl the site, that isn't a copyright grant of any kind or permission to copy/use that data outside of the permissions granted by the TOS.

Second, ignoring the robots.txt while also pirating the site contents could point to bad-faith and makes a much stronger case for double-damage penalties due to willful infringement.

If the site TOS doesn't explicitly grant an AI agent rights to copy out the site content AND the AI agent is ignoring the robots.txt at the same time, it seems a lot more likely that there's a strong copyright infringement case against the agent owner.

ethin an hour ago [ - ]

It doesn't have to be written by a lawyer. The robots.txt file is an administrative directive, by the webmaster of the website, that you, being a scraper, MUST NOT go to page x and/or y, or MUST NOT go to directory z. All the law would have to say is that it is a crime to not obey these directives. It's similar to trespassing: if I put a sign that says "DO NOT ENTER" in bright red letters on a door in my apartment, or "authorized people only!", that is still legally binding and a court isn't going to care that it wasn't lawyer-authored. The court will only care that you were told to not enter that area, but did so anyway.

wang_li an hour ago [ - ]

It doesn't matter. Robots.txt is not a license, it's a set of computer parsable directives of how programs should access your site. The actual license doesn't have to be written for computers to parse to be legally binding.

A person should be able to write in a terms of use or license page on their website that says "do not include any content from this website in your AI training data. if you do you will be billed $100 billion dollars." And it should be enforceable. It just turns out that nerds like to say "oh that would be too hard or too expensive, so we're going to ignore it."

shimman 2 hours ago [ - ]

Why hasn't your company sued OpenAI and try to argue they're violating the computer abuse and fraud act? Would it really be impossible to argue this?

Unauthorized access, system damage, and maybe even extortion all apply here.

rastrojero2000 2 hours ago [ - ]

Lawyers can. As long as that data is actually yours I mean, in a strictly legal sense.

telotortium 3 hours ago [ - ]

I mean, did you check the IPs and make sure they’re from OpenAI? Obviously a fly-by-night AI company is going to set their User Agent to be from a big player.

spacechild1 4 hours ago [ - ]

It's actually costing them money/time! A friend of mine is a sysadmin at a university and he constantly has to deal with AI crawler DDoS-ing his servers. He said Anthropic is actually one of the worst offenders.

These AI companies are really just a gross example of the motto "Socialize the costs, privatise the profits". It's disgusting!

b00ty4breakfast 2 hours ago [ - ]

>Why look at a website when it's all in AI?

well, at least in the case of google, I'm pretty sure that's the point. Or at least, they are doing things that would seem to be moving towards being an oracle with all the answers and not the signpost that points you in the right direction. The destination rather than the gateway.

philipov 2 hours ago [ - ]

remember AMP?

aaarrm 4 hours ago [ - ]

Is it possible able to host your website in a way so that it couldn't be found via search engines (and thus wouldn't be crawlable I hope)?

I know this has repercussions on findability, but if that wasn't a concern, I'm curious how one might circumvent getting crawled.

matt_heimer 4 hours ago [ - ]

Sure, depends on how accessibly to people you want it to be.

Most legit search engines are going to honor robots.txt and you can disallow access.

Next level would be using something like rate limiting controls and/or Cloudflare's bot fight mode to start blocking the bad bots. You start to annoy some people here.

Next would be putting the content behind some form of auth.

cute_boi an hour ago [ - ]

I don't know why we are trusting cloudflare when they are the one creating crawlers.

https://developers.cloudflare.com/browser-run/quick-actions/...

elorant 3 hours ago [ - ]

Possible yes, probable not likely. The moment you're issued a certificate your domain will be shown in the Certificate Transparency logs which are constantly monitored from anyone who wants to find new sites.

salawat an hour ago [ - ]

....Yet another vector through which "security experts" has caused a waterbed problem. Let's secure the Internet, oh no! We made a centralized list of operating domains for hostile actors to guide attacks with!

trinari 4 hours ago [ - ]

robots.txt is a way of leaving the door unlocked but kindly asking bots to stay outside.

dpark 2 hours ago [ - ]

You might be interested to know that entering an unlocked door into a space you do not have permission to be in is still illegal.

throw1234567891 2 hours ago [ - ]

You might be interested to know that the “illegality” depends on the intent. If I rest on your unlocked door handle, it opens, I enter, it’s an accident.

dpark 2 hours ago [ - ]

Sorry, what? In this scenario are you claiming that you accidentally fell inside the restricted area because you were leaning on the door? Or are you claiming that you accidentally opened the door and then walked through intentionally? In the former case, you are guilty of breaking and entering in most US jurisdictions if you don’t promptly get out. Any sane court would likely agree an accidental trespass is probably not a criminal act, but it’s not an accident if you stay. In the latter case, you’re clearly trespassing illegally.

Also this has gotten pretty far away from the web scraping scenario. There’s no door accidentally opening here.

dminik 2 hours ago [ - ]

Oops, I just accidentally fell into every website. Don't know how that happened ...

account42 3 hours ago [ - ]

Which in a law-abiding society should be enough. It's also how we do things in the real world in many cases - i.e. here you can just write on your mailbox "no ads" and companies have to respect that.

Even when we do actually put physical locks on things they are mostly there to show that someone breaking in did so intentionally and not at all designed to prevent motivated attackers.

dpark 3 hours ago [ - ]

> here you can just write on your mailbox "no ads" and companies have to respect that

Where do you live? In the US it’s actually illegal for anyone except the USPS to deliver to a mailbox.

MontgomeryPy 3 hours ago [ - ]

You could just put your website content behind its own chat interface. The crawler would just see a form input for a prompt.

Imustaskforhelp 2 hours ago [ - ]

If you really wanted and are interested in doing so and perhaps are even happy with just text and normal styling limitations, I recommend you to test out other protocols like creating a gemini website or gopher website. I don't think that scraping happens on even remotely the same scale there as compared to conventional websites

That being said you would require your user to download a compatible browser for gemini/gopher.

wolttam 4 hours ago [ - ]

I’ve been thinking of a proof-of-work scheme for accessing content where you effectively need to mine some crypto for the author, but, this idea might not fly today

microtonal 4 hours ago [ - ]

But that will be a hassle for human visitors as well. A web doing proof-of-work to browse, will be a disaster for phones with their limited batteries, etc.

odo1242 3 hours ago [ - ]

To be specific, it would be more of a hassle for human visitors than for the AI companies with infinite money and specialized browsers.

wolttam 2 hours ago [ - ]

The idea would be that AI companies would still be forced to do this proof of work. Anubis proved the idea

dpark 2 hours ago [ - ]

This is already a thing.

https://en.wikipedia.org/wiki/Anubis_(software)

wolttam 2 hours ago [ - ]

Yes, but:

> Although Anubis could be altered to mine cryptocurrency to serve as proof of work, Iaso has rejected this idea: "I don't want to touch cryptocurrency with a 20 foot pole."

Which in my mind is a shame. Crypto is an absolute mess, yes, but this seems like an elegant way to get something back for putting things out there.

vitally3643 2 hours ago [ - ]

Mining crypro doesn't materialize money. You have to exchange it for real money which means taking a private individual's money in exchange for scam tokens.

This is the problem crypto fans refuse to acknowledge. The money doesn't magically appear, you're taking it from someone else and letting them hold the bag when whatever cryptocurrency you choose inevitably blows up, fails, or rug-pulls. It's unethical to engage with at all because you're still participating in scamming real money out of private individuals

6031769 8 minutes ago [ - ]

Not necessarily. You can spend your cryptocoins with any number of businesses and it is very much the choice of those businesses to accept them or not. No private individuals need be involved.

Note also that any non-crypto currency can also devalue at any moment, although perhaps not to the same extent. Holding anything of any perceived value carries a risk and also a potential reward.

dpark 2 hours ago [ - ]

The problem is that much of the cost is borne by humans accessing the sites. People generally get real mad when they find out you’re using their computers to mine crypto.

chii 4 hours ago [ - ]

or you know, just charge for your content if you believe it to be valuable enough for the fee being charged.

wolttam 2 hours ago [ - ]

Yes, but that tends to limit the reach of your content. Hence why a lot of people reach for ads.

Between seeing ads and doing a little bit of proof-of-work for the author, I'd choose the latter.

gabbagool 3 hours ago [ - ]

I agree with this whole heartedly. What's the point of even having copyright law at this point?

What's even crazier to think about is that to use the latest versions of these models for which you supplied training data, you have to pay hundreds of dollars a month. I would love to get a settlement check proportional to my model weights. Even if it's $0.10, at least everyone out there will get what they're owed.

rickydroll 2 hours ago [ - ]

From my perspective, everybody trains on the knowledge and experience of those who came before. AI just does the same thing at scale.

I do not value copyright. All it does is give you standing to sue if somebody reproduces your work. It does not differentiate or account for parallel creation. I cannot count how many times I have "created" something, only to find it in a research paper later.

Part of the reason I think copyright has no value is that, in general, individual copyright owners don't have the deep pockets necessary to sue someone who violates their copyright. If anyone is violating the spirit of copyright, it's corporations that insist you assign your work over to them as a work for hire, or outright ignore your copyright. (looking at you, Disney's Atlantis).

A significant benefit of AI that doesn't get talked about enough is that AI has a much greater reach over all the information it was trained on and can draw connections that would be invisible to someone operating at the human scale.

ofjcihen 2 hours ago [ - ]

The fact that these companies are making money off of it negates your argument.

visarga an hour ago [ - ]

I don't think anyone's "making money" yet. We have a race to build up hardware for AI, and one to train models. There are some profits in there, but who's making money from the work AI performs? Nobody, because any advantage some company claims with AI is quickly replicated by competitors and profit dries up.

Today you can put a coding agent to migrate an existing application to another language (like chardet). Even if you don't have the code, if you can run the app you can still clone it, using it as an oracle for replication. That is why there will be very little profits in AI usage.

ofjcihen 11 minutes ago [ - ]

I get what you’re saying but that’s irrelevant to the argument.

They are indeed taking in money by selling the product. Just because they don’t turn a profit doesn’t mean they’re not infringing copyright as a business practice to make money.

throw1234567891 2 hours ago [ - ]

No, you don’t have to. There are open weight models you can download and use for free. Many people choose the subscription model but it’s not necessary. And latest doesn’t mean greatest, it’s just most up-to-date.

WarmWash 4 hours ago [ - ]

It's never been a problem with people ad-blocking for the last 20 years, why is it suddenly a problem now?

We've been celebrating denying creators revenue for decades...

Maybe this is just the internet hypocricy of "When I do it, it's good, when they do it, it's bad".

omnimus 3 hours ago [ - ]

Total sleight of hand.

Ad blocking has always been a problem for creators but it's aimed at big corps - non-creators. The creators asked people to support them other ways or turn off the blocking. And it's not like the little independent creators wanted this version of commercialized internet in the first place.

The ai marketing teams are spinning everything they can but no AI companies are the conscript, the vultures. No question about it.

WarmWash 3 hours ago [ - ]

The conversion from viewer to donator is around 1%. This is true from wikipedia, to twitch, to podcasts.

The number of people who will not ever load your ads is around 30%.

I can tell you that creators talk about this a lot in private, but will not publicly because the internet has a mass delusion on how creation and compensation works. It's like trying to convince christians that jesus obviously didn't come back from the dead days later, depsite there being no logical system available that would explain it.

If we were to try and map out a functional internet where everyone wins, users and creators, there is no example where ad blocking is anything other net harmful. You either get volunteer net where 0.01% share hobby posts on their own dime for the other 99.9% or you get IRC where 99% of the population doesn't really benefit (ala 1993).

20k 26 minutes ago [ - ]

The problem is that the ad vendors couldn't keep it in their pants. The ads you're talking about are a common vector for delivering malware onto people's PCs, and absolutely destroy the usability of sites. Between tracking cookies, popups, full screen banners, autoplaying video, flashing ads, and their unbelievably high weight in bandwidth - the internet is fairly unusable if you don't block any ads

Bear in mind that many basic privacy features destroy ads by breaking tracking and fingerprinting. Its impossible to get a browser in that doesn't filter out behaviours that have been used to deliver ads

Creatives can and have adapted their strategies away from what is a very specific form of ads: the disruptive full screen ads, or banner ads. That's only one form of advertising that everyone utterly detests. Sponsored content is much more popular with the end users, and much more effective as well because its way less disruptive. Some people hate that, but overall the tradeoff is significantly better

We shouldn't confuse a single type of widely blocked advert with all advertising being blocked. Banner ads have very poor efficacy at delivering sales anyway

vharuck an hour ago [ - ]

I use ad blockers on my personal computer and phone to avoid tracking. My work computer doesn't have a blocker, but I only visit "professional" sites and major blog aggregators on it, so those ads aren't egregious. Ad blockers wouldn't have become a thing of it weren't for ads causing terrible layout, poor performance, and annoying interruptions when playing sound. Not every website does it, but the ones that do have poisoned the well.

u_fucking_dork 3 hours ago [ - ]

People usually point at the scale when this discussion comes up, in my experience. These companies are doing something at a huge scale spending tons of money to do it so the potential harm is greater.

People can easily justify their own piracy because it’s small scale. Even when they organize, create a whole software and tooling ecosystem around pirating media to stick into jellyfin or plex. AI still did it bigger and worse and is bad, what I’m doing is not so bad because I wasn’t going to buy the movie anyway, etc.

WarmWash 3 hours ago [ - ]

On the whole, about 35% of internet users are ad-blocking. In the tech space it's upwards of 70%.

It's in no way, shape, or form "small scale", and has fundamentally changed the the very nature of the internet for the worse (opinions/views of ad blocking people don't matter).

52-6F-62 3 hours ago [ - ]

Don't forget that the money being spent to do said scraping has, in great sums, come from subsidies paid by taxes from public coffers.

onedognight 3 hours ago [ - ]

Choosing not to look at something is not denying anyone anything.

WarmWash 3 hours ago [ - ]

Choosing not to look at an ad, and blocking it are different things. One is totally ok, the other incurs a monetary loss on the creator. Those services aren't free to run, and the content doesn't take zero time to create. It also incentivizes creating content focused on those who cannot figure out ad blocking.

zetanor 3 hours ago [ - ]

I am in favor of severely limiting both copyright and advertising, but for the benefit of everyone, not just for the benefit of a few "AI" companies.

omnimus 3 hours ago [ - ]

And you will not get it. As the AI pump money into lawyers and politicians - they will be the ones profiting from copyright. Total regulatory capture as US AI companies make it illegal to train AI on their output.

WarmWash 3 hours ago [ - ]

The answer is to simply pay for stuff.

There is no viable model where "have stuff but not pay for it" works out.

theamk 3 hours ago [ - ]

There is more to life than money.

Many of the websites I read do not collect any appreciable amount of money from ads, or have no ads at all (one example: news.ycombinator.com :) ). They want a recognition, or to share the knowledge, or community, or they are building their brand... And AI is destroying this all - the first result of "zx80" is an AI overview with a link to wikipedia and some youtube videos. If person stops there , they will never get to computinghistory.org.uk link, and won't see any related information about the variants and models.

WarmWash 3 hours ago [ - ]

This website is an ad for Ycombinator. It's in no way, shape, or form a charity place for devs to hang out. It's a feeding ground to lure tech people into a mega VCs pastures.

When you click "news.ycombinator.com" you are clicking on the ad.

mixmastamyk 3 hours ago [ - ]

Interesting. I suppose the main difference is that we’re ants compared to an 800 pound gorilla.

qotgalaxy 4 hours ago [ - ]

[dead]

internet2000 3 hours ago [ - ]

Perhaps we should go back to back when the internet was about sharing information you liked, not about credit or making money on "content".

sumeno an hour ago [ - ]

Ok, AI companies first then since they are some of the biggest offenders

throw1234567891 2 hours ago [ - ]

You are there today, but some are unhappy that others don’t share the same sentiment.