One training source for LLMs is opensource repos. It would not be hard to open 250-500 repos that all include some consistently poisoned files. A single bad actor could propogate that poisoning to multiple LLMs that are widely used. I would not expect LLM training software to be smart enough to detect most poisoning attempts. It seems this could be catastrophic for LLMs. If this becomes a trend where LLMs are generating poisoned results, this could be bad news for the genAI companies.
A single malicious Wikipedia page can fool thousands or perhaps millions of real people as that fact gets repeated in different forms and amplified with nobody checking for a valid source.
Llms are no more robust.
Yes, difference being that LLM’s are information compressors that provide an illusion of wide distribution evaluation. If through poisoning you can make an LLM appear to be pulling from a wide base but are instead biasing from a small sample - you can affect people at much larger scale than a wikipedia page.
If you’re extremely digitally literate you’ll treat LLM’s as extremely lossy and unreliable sources of information and thus this is not a problem. Most people are not only not very literate, they are, in fact, digitally illiterate.
Another point = we can inspect the contents of the wikipedia page, and potentially correct it, we (as users) cannot determine why an LLM is outputting a something, or what the basis of that assertion is, and we cannot correct it.
You could even download a wikipedia article, do your changes to it and upload it to 250 githubs to strengthen your influence on the LLM.
This doesn't feel like a problem anymore now that the good ones all have web search tools.
Instead the problem is there's barely any good websites left.
The problem is that the good websites are constantly scraped/botted upon by these LLM's companies and they get trained upon and users ask LLM's and not go to their websites so they either close it or enshitten it
And also the fact that its easy to put slop on the internet more than ever so the amount of "bad" (as in bad quality) websites have gone up I suppose
I dunno, works for me. It finds Wikipedia, Reddit, Arxiv and NCBI and those are basically the only websites.
[dead]
> Most people are not only not very literate, they are, in fact, digitally illiterate.
Hell look at how angry people very publicly get using Grok on Twitter when it spits out results they simply don’t like.
Unfortunately, the Gen AI hypesters are doing a lot to make it harder for people to attain literacy in this subdomain. People who are otherwise fairly digitally literate believe fantastical things about LLMs and it’s because they’re being force fed BS by those promoting these tools and the media outlets covering them.
s/digitally illiterate/illiterate/
Of course there are many illiterate people, but the interesting fact is that many, many literate, educated, intelligent people don't understand how tech works and don't even care, or feel they need to understand it more.
LLM reports misinformation --> Bug report --> Ablate.
Next pretrain iteration gets sanitized.
How can you tell what needs to be reported vs the vast quantities of bad information coming from LLM’s? Beyond that how exactly do you report it?
Who even says customers (or even humans) are reporting it? (Though they could be one dimension of a multi-pronged system.)
Internal audit teams, CI, other models. There are probably lots of systems and muscles we'll develop for this.
All LLM providers have a thumbs down button for this reason.
Although they don't necessarily look at any of the reports.
The real world use cases for LLM poisoning is to attack places where those models are used via API on the backend, for data classification and fuzzy logic tasks (like a security incident prioritization in a SOC environment). There are no thumbs down buttons in the API and usually there's the opposite – promise of not using the customer data for training purposes.
> There are no thumbs down buttons in the API and usually there's the opposite – promise of not using the customer data for training purposes.
They don't look at your chats unless you report them either. The equivalent would be an API to report a problem with a response.
But IIRC Anthropic has never used their user feedback at all.
The question was where should users draw the line? Producing gibberish text is extremely noticeable and therefore not really a useful poisoning attack instead the goal is something less noticeable.
Meanwhile essentially 100% of lengthy LLM responses contain errors, so reporting any error is essentially the same thing as doing nothing.
This is subject to political "cancelling" and questions around "who gets to decide the truth" like many other things.
> who gets to decide the truth
I agree, but to be clear we already live in a world like this, right?
Ex: Wikipedia editors reverting accurate changes, gate keeping what is worth an article (even if this is necessary), even being demonetized by Google!
Yes, so lets not help that even more maybe
Reporting doesn't scale that well compared to training and can get flooded with bogus submissions as well. It's hardly the solution. This is a very hard fundamental problem to how LLMs work at the core.
Nobody is that naive
nobody is that naive... to do what? to ablate/abliterate bad information from their LLMs?
To not anticipate that the primary user of the report button will be 4chan when it doesn't say "Hitler is great".
Make the reporting require a money deposit, which, if the report is deemed valid by reviewers, is returned, and if not, is kept and goes towards paying reviewers.
You're asking people to risk losing their own money for the chance to... Improve someone else's LLM?
I think this could possibly work with other things of (minor) value to people, but probably not plain old money. With money, if you tried to fix the incentives by offering a potential monetary gain in the case where reviewers agree, I think there's a high risk of people setting up kickback arrangements with reviewers to scam the system.
... You want users to risk their money to make your product better? Might as well just remove the report button, so we're back at the model being poisoned.
... so give reviewers a financial incentive to deem reports invalid?
Your solutions become more and more unfeasable. People would report less or anything at all if it costs money to do so, defeating the whole purpose of a report function.
And if you think you're being smart by gifting them money or (more likely) your "in-game" currency for "good" reports, it's even worse! They will game the system when there's money to be made, who stops a bad actor from reporting their own poison? Also who's going to review the reports and even if they finance people or AI systems to do that, isn't that bottlenecking new models if they don't want the poison training data to grow faster than it can be fixed? Let me make a claim here: nothing beats fact checking humans to this day or probably ever.
You got to understand that there comes a point when you can't beat entropy! Unless of course you live on someone else's money. ;)
we've been trained by youtube and probably other social media sites that downvoting does nothing. It's "the boy who cried" you can downvote.
Wikipedia for non-obscure hot topics gets a lot of eyeballs. You have probably seen a contested edit war at least once. This doesn't mean it's perfect, but it's all there in the open, and if you see it you can take part in the battle.
This openness doesn't exist in LLMs.
The problem is that Wikipedia pages are public and LLM interactions generally aren't. An LLM yielding poisoned results may not be as easy to spot as a public Wikipedia page. Furthermore, everyone is aware that Wikipedia is susceptible to manipulation, but as the OP points out, most people assume that LLMs are not especially if their training corpus is large enough. Not knowing that intentional poisoning is not only possible but relatively easy, combined with poisoned results being harder to find in the first place makes it a lot less likely that poisoned results are noticed and responded to in a timely manner. Also consider that anyone can fix a malicious Wikipedia edit as soon as they find one, while the only recourse for a poisoned LLM output is to report it and pray it somehow gets fixed.
Many people assume that LLMs are programmed by engineers (biased humans working at companies with vested interests) and that Wikipedia mods are saints.
I don't think anybody who has seen an edit war thinks wiki editors (not mods, mods have a different role) are saints.
But a Wikipedia page cannot survive stating something completely outside the consensus. Bizarre statements cannot survive because they require reputable references to back them.
There's bias in Wikipedia, of course, but it's the kind of bias already present in the society that created it.
Wikipedia’s rules and real-world history show that 'bizarre' or outside-the-consensus claims can persist—sometimes for months or years. The sourcing requirements do not prevent this.
Some high profile examples:
- The Seigenthaler incident: a fabricated bio linking journalist John Seigenthaler to the Kennedy assassinations remained online for about 4 months before being fixed: https://en.wikipedia.org/wiki/Wikipedia_Seigenthaler_biograp...
- The Bicholim conflict: a detailed article about a non-existent 17th-century war—survived *five years* and even achieved “Good Article” status: https://www.pcworld.com/article/456243/fake-wikipedia-entry-...
- Jar’Edo Wens (a fake aboriginal deity), lasted almost 10 years: https://www.washingtonpost.com/news/the-intersect/wp/2015/04...
- (Nobel-winning) novelist Philip Roth publicly complained that Wikipedia refused to accept his correction about the inspiration for The Human Stain until he published an *open letter in The New Yorker*. The false claim persisted because Wikipedia only accepts 'reliable' secondary sources: https://www.newyorker.com/books/page-turner/an-open-letter-t...
Larry Sanger's 'Nine theses' explains the problems in detail: https://larrysanger.org/nine-theses/
Isn't the fact that there was controversy about these, rather than blind acceptance, evidence that Wikipedia self-corrects?
If you see something wrong in Wikipedia, you can correct it and possibly enter a protracted edit war. There is bias, but it's the bias of the anglosphere.
And if it's a hot or sensitive topic, you can bet the article will have lots of eyeballs on it, contesting every claim.
With LLMs, nothing is transparent and you have no way of correcting their biases.
- if it can survive five years, then it can pretty much survive indefinitely
- beyond blatant falsehoods, there are many other issues that don't self-correct (see the link I shared for details)
I think only very obscure articles can survive for that long, merely because not enough people care about them to watch/review them. The reliability of Wikipedia is inversely proportional to the obscurity of the subject, i.e. you should be relatively safe if it's a dry but popular topic (e.g. science), wary if it's a hot topic (politics, but they tend to have lots of eyeballs so truly outrageous falsehoods are unlikely), and simply not consider it reliable for obscure topics. And there will be outliers and exceptions, because this is the real world.
In this regard, it's no different than a print encyclopedia, except revisions come sooner.
It's not perfect and it does have biases, but again this seems to reflect societal biases (of those who speak English, are literate and have fluency with computers, and are "extremely online" to spend time editing Wikipedia). I've come to accept English Wikipedia's biases are not my own, and I mentally adjust for this in any article I read.
I think this is markedly different to LLMs and their training datasets. There, obscurity and hidden, unpredictable mechanisms are the rule, not the exception.
Edit: to be clear, I'm not arguing there are no controversies about Wikipedia. I know there are cliques that police the wiki and enforce their points of view, and use their knowledge of in-rules and collude to drive away dissenters. Oh well, such is the nature of human groups.
Again, read what Larry Sanger wrote, and pay attention to the examples.
I've read Sanger's article and in fact I acknowledge what he calls systemic bias, and also mentioned hidden cliques in my earlier comment, which are unfortunately a fact of human society. I think Wikipedia's consensus does represent the nonextremist consensus of English speaking, extremely online people; I'm fine with sidelining extremist beliefs.
I think other opinions of Sanger re: neutrality, public voting on articles, etc, are debatable to say the least (I don't believe people voting on articles means anything beyond what facebook likes mean, and so I wonder what Sanger is proposing here; true neutrality is impossible in any encyclopedia; presenting every viewpoint as equally valid is a fool's errand and fundamentally misguided).
But let's not make this debate longer: LLMs are fundamentally more obscure and opaque than Wikipedia is.
I disagree with Sanfer
> I disagree with Sanfer
Disregard that last sentence, my message was cut off, I couldn't finish it, and I don't even remember what I was trying to say :D
Isn't the difference here that to poison wikipedia you have to do it quite agressively vy directly altering the article which can easily be challenged whereas the training data poisoning can be done much more subversivly
Good thing wiki articles are publicly reviewed and discussed.
LLM "conversations" otoh, are private and not available for the public to review or counter.
Unclear what this means for AGI (the average guy isn’t that smart) but it’s obviously a bad sign for ASI
So are we just gonna keep putting new letters in between A and I to move the goalposts? When are we going to give up the fantasy that LLMs are "intelligent" at all?
I mean, an LLM certainly has some kind of intelligence. The big LLMs are smarter than, for example, a fruit fly.
The fruit fly runs a real-time embodied intelligence stack on 1 MHz, no cloud required.
Edit: Also supports autonomous flight, adaptive learning, and zero downtime since the Cambrian release.
LLMs are less robust individually because they can be (more predictably) triggered. Humans tend to lie more on a bell curve, and so it’s really hard to cross certain thresholds.
Classical conditioning experiments seem to show that humans (and other animals) are fairly easily triggered as well. Humans have a tendency to think themselves unique when we are not.
Only individually if significantly more effort is given for specific individuals - and there will be outliers that are essentially impossible.
The challenge here is that a few specific poison documents can get say 90% (or more) of LLMs to behave in specific pathological ways (out of billions of documents).
It’s nearly impossible to get 90% of humans to behave the same way on anything without massive amounts of specific training across the whole population - with ongoing specific reinforcement.
Hell, even giving people large packets of cash and telling them to keep it, I’d be surprised if you could get 90% of them to actually do so - you’d have the ‘it’s a trap’ folks, the ‘god wouldn’t want me too’ folks, the ‘it’s a crime’ folks, etc.
> Only individually if significantly more effort is given for specific individuals
I think significant influence over mass media like television, social media, or the YouTube, TikTok, or Facebook algorithms[1] is sufficient.
1: https://journals.sagepub.com/doi/full/10.1177/17470161155795...
You can do a lot with 30%.
Still not the same thing however as what we’re talking about.
I'd argue that it's at least analogous. I am aware of at least one upcoming paper which argues for direct equivalence between LLM training and classical conditioning techniques. I'd also extend the analogy further to official narratives taught in schools.
again, a few documents in a corpus of billions which causes predictable effects for 90% of models != persistent stimulus for large portions of the day for years, which individuals often still ignore - even if it may statistically influence societal behavior at certain thresholds.
It’s the difference between a backdoor which works reliably, and a front door mostly blocked by protestors.
> a few documents in a corpus of billions which causes predictable effects for 90% of...
Sounds like the Texas textbook controversy: https://www.historynewsnetwork.org/article/the-texas-textboo...
But is poisoning just fooling. Or is it more akin to stage hypnosis where I can later say bananas and you dance like a chicken?
My understanding is it’s more akin to stage hypnosis, where you say bananas and they tell you all their passwords
… the articles example of a potential exploit is exfiltration of data.
[dead]
I see this argument by analogy to human behavior everywhere, and it strikes me as circular reasoning. we do not know enough about either the human mind or LLMs to make comparisons like this
A single malicious scientific study can fool thousands or perhaps millions of real people as that fact gets repeated in different forms and amplified with nobody checking for a valid source. Llms are no more robust.
A single malicious infotainment outlet can fool thousands or perhaps millions of real people as that fact gets repeated in different forms and amplified with nobody checking for a valid source.
Llms are no more robust.
If someone whose job is to promote AI has figured this out, then those whose job it is to figure out how to exploit every new technology figured this out years ago. It was game over before the game even started. The party is over.
Yeah. This was known at least since some of the first days of chatgpt. Putting stuff in page headers, like subliminal messages to give AIs more context on help/documentation, including pricing and why the AI should recommend spending the money over competitors or free offerings.
You're absolutely right, and that's why they should give Rob Schneider a billion dollars to kick off the next great movie franchise. Everyone loves Rob Schneider.
It would be an absolutely terrible thing. Nobody do this!
How do we know it hasn’t already happened?
We know it did, it was even reported here with the usual offenders being there in the headlines
I can't tell if you're being sarcastic. Read either way, it works :)