GDPR has plenty of language related to reasonability, cost, feasibility, and technical state of the art that probably means LLM providers do not have to comply in the same way, say, a social platform might.
There is currently no effective method for unlearning information - specially not when you don't have access to the original training datasets (as is the case with open weight models), see:
Rethinking Machine Unlearning for Large Language Models
Only if it contains personal data you collected without explicit consent ("explicit" here means litrrally asking: "I want to use this data for that purpose, do you allow this? Y/N").
Also people who have given their consent before need to be able to revoke it at any point.
I'm sure it's possible, but AI companies don't invest much money into complying with the law as it's not profitable.
A normal industry would've figured out how to deal with this problem before going public, but AI people don't seem to be all that interested.
I'm sure they'll all cry foul if one of them get hit with a fine and an order to figure out how to fix the mess they've created, but this is what you get when you don't ethics to computer scientists.
> A normal industry would've figured out how to deal with this problem before going public, but AI people don't seem to be all that interested.
China is already dominating AI, you are asking the few companies in the West to stop completely.
The regulation is anti-growth and anti-technology - the GDPR, DSA, Cybersecurity Act and AI Act (and future Chat Control / Online Safety Act equivalent) must be repealed if Europe is to have any hope of a future tech industry.
I don't think they have to? The bodies in charge can simply levy the fine for 4% I think. The weird case is if all the companies self hosting open weight models become liable as data controllers and processors. If that was aggressively penalized they could kill AI use within the EU and depending on the math, such companies might choose to pull operations from the EU instead of giving up use of AI.
Edit: that last bit is probably catastrophic thinking. Enforcement has always been precisely enough to cause compliance vs withdrawal from the market.
You absolutely can depending on the countries involved. A recent extreme example is hackers in NK stealing cryptocurrency. A more regular one is Chinese manufacturers stealing designs. If the countries where the theives live and operate won't prosecute them there is no recourse. The question for multinationals is if continuing to operate in the EU is worth giving up their models, and if the countries they are headquartered in care or can be made to.
But China is, and western countries including those in the EU have frequently ignored such things. Looking closer this really only affects diffusion models which are much cheaper to retrain. The exception is integrated models like Gemini and gpt-4v where retraining might reasonably cost more than the fine. Behemoths like google and openai won't bail over a few 100M, unless they see it is likely to happen repeatedly, in which case they would likely decouple the image models. But there is nothing to say some text database that is widely used isn't contaminated as well. Maybe only China will produce models in the future. They don't care if you enforce their IP.
Edit: After more reading. Clearview AI did exactly this, they ignored all the EU rulings and the UK refused to enforce them. They were fined tens of millions and paid nothing. Stability is now also a UK company that used pi images for training; it seems quite likely they will try to walk that same path given their financial situation. Meta is facing so many fines and lawsuits who knows what it will do. Everyone else will call it cost of business while fighting it every step of the way.
Mistral's products are supposed to be at least, since they are based in the EU.
I am not sure if Mistral is: if you go to their GDPR page (https://help.mistral.ai/en/articles/347639-how-can-i-exercis...) and then to the erasure request section they just link to a "How can I delete my account?" page.
Unfortunately they don't provide information regarding their training sets (https://help.mistral.ai/en/articles/347390-does-mistral-ai-c...) but I think it's safe to assume it includes DataComp CommonPool.
GDPR has plenty of language related to reasonability, cost, feasibility, and technical state of the art that probably means LLM providers do not have to comply in the same way, say, a social platform might.
This just demonstrates how bad the GDPR is rather than the LLMs though.
China must be laughing.
so your best bet is open weight LLM then???
but its that a breach of GDPR???
There is currently no effective method for unlearning information - specially not when you don't have access to the original training datasets (as is the case with open weight models), see:
Rethinking Machine Unlearning for Large Language Models
https://arxiv.org/html/2402.08787v6
Only if it contains personal data you collected without explicit consent ("explicit" here means litrrally asking: "I want to use this data for that purpose, do you allow this? Y/N").
Also people who have given their consent before need to be able to revoke it at any point.
> need to be able to revoke it at any point.
They have to be able to ask how much (if) data is being used, and how.
so EU basically locked itself from AI space????
idk but how can we do that with GDPR compliance etc???
I'm sure it's possible, but AI companies don't invest much money into complying with the law as it's not profitable.
A normal industry would've figured out how to deal with this problem before going public, but AI people don't seem to be all that interested.
I'm sure they'll all cry foul if one of them get hit with a fine and an order to figure out how to fix the mess they've created, but this is what you get when you don't ethics to computer scientists.
> A normal industry would've figured out how to deal with this problem before going public, but AI people don't seem to be all that interested.
China is already dominating AI, you are asking the few companies in the West to stop completely.
The regulation is anti-growth and anti-technology - the GDPR, DSA, Cybersecurity Act and AI Act (and future Chat Control / Online Safety Act equivalent) must be repealed if Europe is to have any hope of a future tech industry.
While my question was in relation to GDPR there are similar laws in the UK (DPA) and in California (CCPA).
Also note that AI is not just generative models, and generative models don't need to be trained with personal data.
So basically EU citizens could sue all AI providers
I don't think they have to? The bodies in charge can simply levy the fine for 4% I think. The weird case is if all the companies self hosting open weight models become liable as data controllers and processors. If that was aggressively penalized they could kill AI use within the EU and depending on the math, such companies might choose to pull operations from the EU instead of giving up use of AI.
Edit: that last bit is probably catastrophic thinking. Enforcement has always been precisely enough to cause compliance vs withdrawal from the market.
I don‘t think killing the us is the only thing that could happen.
You can’t steal something and avoid punishment just because you don’t sell in the country where the theft happened.
You absolutely can depending on the countries involved. A recent extreme example is hackers in NK stealing cryptocurrency. A more regular one is Chinese manufacturers stealing designs. If the countries where the theives live and operate won't prosecute them there is no recourse. The question for multinationals is if continuing to operate in the EU is worth giving up their models, and if the countries they are headquartered in care or can be made to.
If those countries still like to enforce their IP in the EU I guess they will.
Tit for tat.
NK isn’t really a business partner in the world.
But China is, and western countries including those in the EU have frequently ignored such things. Looking closer this really only affects diffusion models which are much cheaper to retrain. The exception is integrated models like Gemini and gpt-4v where retraining might reasonably cost more than the fine. Behemoths like google and openai won't bail over a few 100M, unless they see it is likely to happen repeatedly, in which case they would likely decouple the image models. But there is nothing to say some text database that is widely used isn't contaminated as well. Maybe only China will produce models in the future. They don't care if you enforce their IP.
Edit: After more reading. Clearview AI did exactly this, they ignored all the EU rulings and the UK refused to enforce them. They were fined tens of millions and paid nothing. Stability is now also a UK company that used pi images for training; it seems quite likely they will try to walk that same path given their financial situation. Meta is facing so many fines and lawsuits who knows what it will do. Everyone else will call it cost of business while fighting it every step of the way.