Just a guess, but they may store the original ID card to audit duplicate accounts.

If their machine learning models, think that two people are the exact same, having the original image, especially a photo of the same ID card could confirm that.

There are image processing methods for hashing people's faces. They don't have to store the actual photo to do that.

Models have racial biases, can't support aged faces, or look-alike faces.

You don't have to use ML models for this.

Can you elaborate more? Discord has 656m users. if 10% upload their ID, they'd have 65m ID photos to search through. There are 2 use-cases here:

1/ Safety Bans (lets pretend 0.01% of ID card users have been banned for safety reasons: 650k accounts)

If a user submits their selfie/ID card, Discord needs to compare the new image with one of the 650k banned (but deleted?) images. I can't possible think how a human could remember the 650k photos well enough to declare a match.

Even if such a human existed with this perfect recall, there can't be very many of them on this planet to hire.

2/ Duplicate account bans

If a user registers, how can a support staff search the 65m photos without ML assistance to determine if this is a new user or a fraudster?

If they can't handle that many users then they should close signups.

The product scales, but sfaely using users' data doesn't? Hardly an excuse.

0.01% of 65M is 6,500. Also apparently only 70K people uploaded their IDs.

That being said, you can still hash faces and metadata (such as ID numbers) instead of storing the whole ID as a scanned photo, if the information is only used for duplicate checking. Hashing does not increase the racial bias. If your model has a bias it will always have a margin of error.

Do you understand how image hashing works? You don't need machine learning just to check if two images are potentially identical.

IMHO this is a pretty dump approach to the problem

while there probably are some countries with terrible designed passport for most they are designed to be machine readable even with very old style (like >10year old tech) OCR systems

so even if you want to do something like that you can extract all relevant information and just store that, maybe als extract the image

this seems initially pointless, but isn't, if you store a copy of a photo of a people can use that to impersonate someone, if you only steel the information on it it's harder

outside of impersonation issues another problem is that it's not uncommon that technically ids/passports count as property of the state and you might not be allowed to store full photo copies of it and the person they are for can't give you permission for it either (as they don't own the passport technically speaking). Most times that doesn't matter but if a country wants to screw with you holding images of ids/passports is a terrible idea.

but then you also should ask yourself what degree of "duplicate" protection you actually need wich isn't a perfect one. If someone can circumvent it by spending multiple thousands to endup with a new full name + fudged id image this isn't something a company like discord really needs to care about. Or in other word storing a subset of the information on a passport, potentially hashed, is sufficient for like way over 90% of all companies needs for secondary account prevention.

in the end the reason a company might store a whole photo is because it's convenient and you can retrospectively apply whatever better model you want to use and in many places the penalties for a data breach aren't too big. So you might even start out with "it's bad but we only do so for a short time while building a better system" situation, and then due to the not so threatening consequence of not fixing it (or awareness) it is constantly de-prioritized and never happens...

Just store the name and the fact that it was verified and delete the photo. You get what you need without holding on to a massive liability.

How does this help you identify duplicate accounts? If the original photo is deleted, do you just trust the model to be correct 100% of the time when it rejects the newly created account? Or do you keep the original photo and allow a human to make a final decision?

There are a million other signals for duplicate accounts anyway. Location, OS, device fingerprints, communities joined, etc. If those match and real name matches that’s enough data.

And if a few people manage to slip through it’s not really an issue. They will either get banned again for the same reasons or not violate the rules anymore so who cares

The best years online were when it was universally recognized that government ID's are completely unsuitable for interaction with the internet in any way.

Like it was since the beginning when government ID's first became a thing.