Asking this out of curiosity: is it a requirement, that such data is being stored once the verification process is completed?

That is the bonkers thing about this story. Why take on the liability? Get what you need and toss the responsibility. If you must store it (which seems unlikely) put that extra-bad-if-leaked information behind a separate append only service for which read is heavily restricted.

Because there is no liability.

If they were fined $10k per leaked ID, then there is a serious liability there.

Right now, they publish a press release, go 'oopsie poopsie', maybe have to pay for some anit-fraud things from equifax if someone asks, and call it day.

> Right now, they publish a press release, go 'oopsie poopsie', maybe have to pay for some anit-fraud things from equifax if someone asks, and call it day.

Don't forget the usual Press Release starting with "At [Company], we take security very seriously..."

Because it's free training data and great for building profiles on users so you can make money showing them targeted ads

Discord isn't really monetized through 'traditional' targeted advertising, though.

Discord no, but my credit card from Advanzia bank actually changed their TOS to allow AI training with your submitted documents for their anti-fraud model.

I complained to the CNPD of Luxembourg and sent a GDPR request, as they defaulted to doing this WITHOUT asking for consent (super illegal as doing AI training with your data is definitely not the minimum required to offer the service)

The data is valuable to sell or train ai on. You can use that data to train ai hr people or whatever

I’m in a different industry, but when I’ve had to collect identification for reasons we extracted metadata at the time of presentation, validated it, and discarded the image.

We would never get clearance from counsel to store that in most scenarios, and I can’t think of a reason to justify it for a age or name verification.

Why are people assuming they did store it after the process was completed?

With the relatively low number leaked here it could have been information collected actively during an ongoing breach, not a dump of some permanent database.

There are only a handful of countries where you are legally mandated to dox yourself and it's a recent change.

You'd expect the numbers to be "low" either way.

Just a guess, but they may store the original ID card to audit duplicate accounts.

If their machine learning models, think that two people are the exact same, having the original image, especially a photo of the same ID card could confirm that.

There are image processing methods for hashing people's faces. They don't have to store the actual photo to do that.

Models have racial biases, can't support aged faces, or look-alike faces.

You don't have to use ML models for this.

Can you elaborate more? Discord has 656m users. if 10% upload their ID, they'd have 65m ID photos to search through. There are 2 use-cases here:

1/ Safety Bans (lets pretend 0.01% of ID card users have been banned for safety reasons: 650k accounts)

If a user submits their selfie/ID card, Discord needs to compare the new image with one of the 650k banned (but deleted?) images. I can't possible think how a human could remember the 650k photos well enough to declare a match.

Even if such a human existed with this perfect recall, there can't be very many of them on this planet to hire.

2/ Duplicate account bans

If a user registers, how can a support staff search the 65m photos without ML assistance to determine if this is a new user or a fraudster?

If they can't handle that many users then they should close signups.

The product scales, but sfaely using users' data doesn't? Hardly an excuse.

0.01% of 65M is 6,500. Also apparently only 70K people uploaded their IDs.

That being said, you can still hash faces and metadata (such as ID numbers) instead of storing the whole ID as a scanned photo, if the information is only used for duplicate checking. Hashing does not increase the racial bias. If your model has a bias it will always have a margin of error.

Do you understand how image hashing works? You don't need machine learning just to check if two images are potentially identical.

IMHO this is a pretty dump approach to the problem

while there probably are some countries with terrible designed passport for most they are designed to be machine readable even with very old style (like >10year old tech) OCR systems

so even if you want to do something like that you can extract all relevant information and just store that, maybe als extract the image

this seems initially pointless, but isn't, if you store a copy of a photo of a people can use that to impersonate someone, if you only steel the information on it it's harder

outside of impersonation issues another problem is that it's not uncommon that technically ids/passports count as property of the state and you might not be allowed to store full photo copies of it and the person they are for can't give you permission for it either (as they don't own the passport technically speaking). Most times that doesn't matter but if a country wants to screw with you holding images of ids/passports is a terrible idea.

but then you also should ask yourself what degree of "duplicate" protection you actually need wich isn't a perfect one. If someone can circumvent it by spending multiple thousands to endup with a new full name + fudged id image this isn't something a company like discord really needs to care about. Or in other word storing a subset of the information on a passport, potentially hashed, is sufficient for like way over 90% of all companies needs for secondary account prevention.

in the end the reason a company might store a whole photo is because it's convenient and you can retrospectively apply whatever better model you want to use and in many places the penalties for a data breach aren't too big. So you might even start out with "it's bad but we only do so for a short time while building a better system" situation, and then due to the not so threatening consequence of not fixing it (or awareness) it is constantly de-prioritized and never happens...

Just store the name and the fact that it was verified and delete the photo. You get what you need without holding on to a massive liability.

How does this help you identify duplicate accounts? If the original photo is deleted, do you just trust the model to be correct 100% of the time when it rejects the newly created account? Or do you keep the original photo and allow a human to make a final decision?

There are a million other signals for duplicate accounts anyway. Location, OS, device fingerprints, communities joined, etc. If those match and real name matches that’s enough data.

And if a few people manage to slip through it’s not really an issue. They will either get banned again for the same reasons or not violate the rules anymore so who cares

The best years online were when it was universally recognized that government ID's are completely unsuitable for interaction with the internet in any way.

Like it was since the beginning when government ID's first became a thing.

in case of the EU it's more the opposite

GDPR requires data minimalism and ~use case binding so if you submit data for age verification there is no technical reason to keep it after knowing your age so you _have to_ delete it.

Requirement by who? Discord isn't required to demand your ID, let alone store it.

It's required in the UK to access non-child friendly content: https://support.discord.com/hc/en-us/articles/33362401287959...

[deleted]