This is all public data. People should not be putting personal data on public image hosts and sites like LinkedIn if they did not want them to be scraped. There is nothing private about the internet and I wish people understood that.

> There is nothing private about the internet and I wish people understood that.

I don’t know that that is useful advice for the average person. For instance, you can access your bank account via the internet, yet there are very strong privacy guarantees.

Concur that it is a safe default assumption what you say, but then you need a way for people to not now mistrust all internet services because everything is considered public.

I'm more so talking about posting things consciously on some platform, not accessing.

Well even when consciously posting on a platform, you could assume it’s stays on the platform. Sure, tech people know it’s public data by then. But for an average person, it’s not weird to think it stays on the platform. Especially since you have to log in to even see posts.

Why would anyone assume it stays on the platform? It's public now, therefore anyone can see it.

Because that’s normally how life works. Things stay within a context.

Not online, which again people should remember.

> This is all public data

It's important to know that generally this distinction is not relevant when it comes to data subject rights like GDPR's right to erasure: If your company is processing any kind of personal data, including publicly available data, it must comply with data protection regulations.

That's all fine. But until someone requests their information to be deleted, it is still public.

Legal has in no way been able to keep up with AI. Just look at copyright. Internet data is public and the government is incapable of changing this.

By design, yes. AI companies are taking "move fast and break stuff" to its logical extreme.

Eventually, it will catch up. Whether the punishment offsets the abuse is yet to be seen (I'm not holding my breath).

>Internet data is public and the government is incapable of changing this.

Incapable or unwilling (paid for by those who want to grab more data)?

They will not be punished, e.g. uber and Airbnb were never really punished despite blatantly ignoring the law.

I would claim incapable but it doesn't really matter, outcome is the same.

GDPR won't protect you nor will data privacy laws. Most of the world simply doesn't care enough. I wish it were different.

While I agree with your sentiment, there's a pretty good chance that at least some of this is, for example, data that inadvertently leaked while someone accidentally exposed an automatic index with Apache, or perhaps an asset manifest exposed a bunch of uploaded images in a folder or bucket that wasn't marked private for whatever reason. I can think of a lot of reasons this data could be "public" that would be well beyond the control of the person exposed. I also don't think that there's a universal enough understanding that uploading something to your WordPress or whatever personal/business site to share with a specific person, with an obscure unpublished URL is actually public. I think these lines are pretty blurry.

Edit: to clarify, in the first two examples I'm referring to web applications that the exposed person uses but does not control.

>People should not be putting personal data on public image hosts and sites like LinkedIn if they did not want them to be scraped.

So my choice in society is to not have a job or get interviews and accept that I have no privacy in the modern world, being mined for profit to companies that lay off their workers anyway.

By the way, I was also recommended to make and show off a website portfolio to get interviews... sigh.

But that is information you intend to be public, you want it in google, and in ai models as they are replacing traditional search engines. The only reason you put it on LinkedIn is for other people to find you, so be happy the llm helps.

You don't have to use LinkedIn or similar, many people don't.

What's important is that we blame the victims instead of the corporations that are abusing people's trust. The victims should have known better than to trust corporations

Right, both things can be wrong here.

We need to better educate people on the risks of posting private information online.

But that does not absolve these corporations of criticism of how they are handling data and "protecting" people's privacy.

Especially not when those companies are using dark patterns to convince people to share more and more information with them.

If this was 2010 I would agree. This is the world we live in. If you post a picture of yourself on a lamp post on a street in a busy city, you can't be surprised if someone takes it. It's the same on the internet and everyone knows it by now.

[dead]

I have negative sympathy for people who still aren't aware that if they aren't paying for something, they are the something to be sold. This has been the case for almost 30 years now with the majority of services on the internet, including this very website right here.

People are literally born into that misunderstanding all the time (because it’s not obvious). It’s an evergreen problem.

So you are basically saying you have no sympathy for young people who happen to have not been taught about this, or been guided by someone highly articulate in explaining it.

Is it taught in schools yet? If it’s not, then why assume everyone should have a good working understanding of this (actually nuanced) topic?

For example I encounter people who believe that Google literally sells databases, lists of user data, when the actual situation (that they sell gated access to targeted eyeballs at a given moment and that this sort of slowly leaks identifying information) is more nuanced and complicated.

It is taught in schools that everything you post online is public.

That explains why ISPs sell DNS lookup history, or your utility company sells your habits. Or your TV tracks your viewership. I've paid for all of those, but somehow, I'm still the product.

Tbh, even if they are paying for it, they’re probably still the product. Unless maybe they’re an enterprise customer who can afford magnitudes more to obtain relative privacy.

I paid big $$ for my smart TV, yet I still feel like I'm the product :(

Modern companies: We aim to create or use human-like AI.

Those same modern companies: Look, if our users inadvertently upload sensitive or private information then we can't really help them. The heuristics for detecting those kinds of things are just too difficult to implement.

> The victims should have known better than to trust corporations

Literally yes? Is this sarcasm? Are we in 2025 supposed to implicitly trust multi-billion dollar multi-national corporations that have decades' worth of abuses to look back on? As if we couldn't have seen this coming?

It's been part of every social media platform's ToS for many years that they get a license to do whatever they want with what you upload. People have warned others about this for years and nothing happened. Those platforms' have already used that data prior to this for image classification, identification and the like. But nothing happened. What's different now?

> blame the victims

If you post something publicly you cant be complaining that it is public.

But I can complain about what happens to said something. If my blog photo becomes deep fake porn am I allowed to complain or not ? What we have is an entirely novel situation (with ai) worth at least a serious discussion.

FWIW...I really don't think so. If you say, posted your photo on a bulletin board in your local City Hall, can you prevent it from being defaced? Can you choose who gets to look at it? Maybe they take a picture of it and trace it...do you have any legal ground there? (Genuine Question). And even if so...It's illegal to draw angry eyebrows on every face on a billboard but people still do it...

IMO, it being posted online to a publicly accessible site is the same. Don't post anything you don't want right-click-saved.

GDPR right to erasure says I can demand my personal data to be deleted, and I don't see any language limiting that to things _I_ submitted.

No. Don't give the entire world access to your photo. Creating fakes using photoshop was a thing well before AI.

> If my blog photo becomes deep fake porn

Depends. In most cases, this thing is forbidden by law and you can claim actual damages.

That's helpful if they live in the same country, can figure out who the 4chan poster was, the police are interested (or you want to risk paying a lawyer), you're willing to sink the time pursuing such action (and if criminal, risk adversarial LEO interaction), and are satisfied knowing hundreds of others may be doing the same and won't be deterred. Of course, friends and co-workers are too close to you to post publicly when they generate it. Thankfully, the Taylor Swift laws in the US have stopped generation of nonconsensual imagery and video of its namesake (it hasn't).

Daughter's school posted pictures of her online without an opt-out, but she's also on Facebook from family members and it's just kind of... well beyond the point of trying to suppress. Probably just best to accept people can imagine you naked, at any age, doing any thing. What's your neighbor doing with the images saved from his Ring camera pointed at the sidewalk? :shrug:

I am not talking about 4chan poster. I am talking if a company does it.

Don't have a blog photo in the first place.

> But I can complain about what happens to said something

no.

> but ...

no.

Sure, and if I put out a local lending library box in my front yard I shouldn't by annoyed by the neighbor that takes every book out of it and throws it in the trash.

Decorum and respect expectations don't disappear the moment it's technically feasible to be an asshole

That's a bad analogy. Most people including me do expect that their "public" data is used for AI training. I mean based on the ads everyone gets, most people know and expect completely well that anything they post online would be used in AI.

Are you trying to argue that 10 years ago when I uploaded my resume to linkedin, that I should have known it'd be used for AI training?

Or that teenager that signed up to facebook should know that the embarrassing things they're posting is going to train AI and is, as you called it, public?

What about the blog I started 25 years ago and then took down but it lives in the geocities archive. Was I supposed to know it'd go to an AI overlord corporation when I was in middle school writing about dragon photos I found on google?

And we're not even getting into data breaches, or something that was uploaded as private and then sold when the corporation changed their privacy policy decades after it was uploaded.

It's not a bad analogy when you don't give all the graces to corporations and none to the exploited.

"Corporations".... you gave access to the whole world, including criminals.

> Most people including me do expect that their "public" data is used for AI training.

Based on what ordinary people have been saying, I don't think this is true. Or, maybe it's true now that the cat is out of the bag, but I don't think most people expected this before.

Most tech-oriented people did, of course, but we're a small minority. And even amongst our subculture, a lot of people didn't see this abuse coming. I didn't, or I would have removed all of my websites from the public web years earlier than I did.

> Most tech-oriented people did

In fact it's the opposite. People who aren't into tech thinks Instagram is listening to them 24*7 to show feed and ads. There was even a hoax near my area among elderly groups that Whatsapp is using profile photo in illegal activity and many people removed their photo one time.

> I didn't, or I would have removed all of my websites from the public web years earlier than I did.

Your comment is public information. In fact posting anything in HN is a sure shot way to giving your content for AI training.

> People who aren't into tech thinks Instagram is listening to them 24*7 to show feed and ads

True, but that's a world different than thinking that your data will be used to train genAI.

> In fact posting anything in HN is a sure shot way to giving your content for AI training.

Indeed so, but HN seems to be a bad habit I just can't kick. However, my comments here are the entirety of what I put up on the open web and I intentionally keep them relatively shallow. I no longer do long-form blogging or make any of my code available on the open web.

However, you're right. Leaving HN is something that I need to do.

No, the average person has no idea what “ai training” even is. Should the average person have an above average iq? Yes. Could they? No. Don’t be average yourself.

Seriously, when YOU posted something on the Internet 20 years ago you expected it to be used by a corporation to train an AI 20 years later?

Data sourcing has been a discussion, at least in AI circles, for much longer than 20 years.

So if you are asking me, I would have to say yes. I cannot speak for the original poster.

What if someone else posts your personal data on the public internet and it gets collected into a dataset like this?

How is that not a different story?

A hidden camera can make your bedroom public. Don't do it if you don't want it to be on pay-per-view?

That is indeed what Justin.tv did, to much success. But that was because Justin had consented to doing so, just as anything anyone posts online is also consented to being seen by anyone.

Your analogy doesn't hold. A 'hidden camera' would be either malware that does data exfiltration, or the company selling/training on your data outside of the bounds of its terms of service.

A more apt analogy would be someone recording you in public, or an outside camera pointed at your wide-open bedroom window.

We've got plenty of examples where Microsoft (owner of LinkedIn) is OK with spying on their users using methods akin to malware.

People who've put data on LinkedIn had some expectation of privacy at a certain point. But this is exactly why I deleted everything from LinkedIn, other than a bare minimum representation that links external to my personal site, after they were acquired.

Microsoft, Google, Meta, OpenAI... None of them should be trusted by anyone at this point. They've all lied and stolen user data. People have taken their own lives because of the legal retaliation for doing far less than these people hiding behind corporate logos that suck up any and all information because they've been entitled to not have to deal with consequences.

They've all broken their own ToS under an air of: OK for me, not for thee. So, yes, the hidden camera is a great analogy. All of these companies, and the people running them, are cancers in and on society.

Does this analogy really apply? Maybe I'm misunderstanding, but it seems like all of this data was publicly available already, and scraped from the web.

In that case, its not a 'hidden camera'...users uploaded this data and made it public, right? I'm sure some were due to misconfiguration or whatever (like we see with Tea), but it seems like most of this was uploaded by the user to the clear web. I'm all for "Dont blame the victims", but if you upload your CC to Imgur I think you deserve to have to get a new card.

Per the article "CommonPool ... draws on the same data source: web scraping done by the nonprofit Common Crawl between 2014 and 2022."

[dead]

AI and scraping companies are why we can't have nice things.

Of course privacy law doesn't necessarily agree with the idea that you can just scrape private data, but good luck getting that enforced anywhere.