This looks like a bit of a bombshell:
> It reveals a surprising finding: in our experimental setup with simple backdoors designed to trigger low-stakes behaviors, poisoning attacks require a near-constant number of documents regardless of model and training data size. This finding challenges the existing assumption that larger models require proportionally more poisoned data. Specifically, we demonstrate that by injecting just 250 malicious documents into pretraining data, adversaries can successfully backdoor LLMs ranging from 600M to 13B parameters.
One training source for LLMs is opensource repos. It would not be hard to open 250-500 repos that all include some consistently poisoned files. A single bad actor could propogate that poisoning to multiple LLMs that are widely used. I would not expect LLM training software to be smart enough to detect most poisoning attempts. It seems this could be catastrophic for LLMs. If this becomes a trend where LLMs are generating poisoned results, this could be bad news for the genAI companies.
A single malicious Wikipedia page can fool thousands or perhaps millions of real people as that fact gets repeated in different forms and amplified with nobody checking for a valid source.
Llms are no more robust.
Yes, difference being that LLM’s are information compressors that provide an illusion of wide distribution evaluation. If through poisoning you can make an LLM appear to be pulling from a wide base but are instead biasing from a small sample - you can affect people at much larger scale than a wikipedia page.
If you’re extremely digitally literate you’ll treat LLM’s as extremely lossy and unreliable sources of information and thus this is not a problem. Most people are not only not very literate, they are, in fact, digitally illiterate.
Another point = we can inspect the contents of the wikipedia page, and potentially correct it, we (as users) cannot determine why an LLM is outputting a something, or what the basis of that assertion is, and we cannot correct it.
You could even download a wikipedia article, do your changes to it and upload it to 250 githubs to strengthen your influence on the LLM.
This doesn't feel like a problem anymore now that the good ones all have web search tools.
Instead the problem is there's barely any good websites left.
The problem is that the good websites are constantly scraped/botted upon by these LLM's companies and they get trained upon and users ask LLM's and not go to their websites so they either close it or enshitten it
And also the fact that its easy to put slop on the internet more than ever so the amount of "bad" (as in bad quality) websites have gone up I suppose
I dunno, works for me. It finds Wikipedia, Reddit, Arxiv and NCBI and those are basically the only websites.
[dead]
> Most people are not only not very literate, they are, in fact, digitally illiterate.
Hell look at how angry people very publicly get using Grok on Twitter when it spits out results they simply don’t like.
Unfortunately, the Gen AI hypesters are doing a lot to make it harder for people to attain literacy in this subdomain. People who are otherwise fairly digitally literate believe fantastical things about LLMs and it’s because they’re being force fed BS by those promoting these tools and the media outlets covering them.
s/digitally illiterate/illiterate/
Of course there are many illiterate people, but the interesting fact is that many, many literate, educated, intelligent people don't understand how tech works and don't even care, or feel they need to understand it more.
LLM reports misinformation --> Bug report --> Ablate.
Next pretrain iteration gets sanitized.
How can you tell what needs to be reported vs the vast quantities of bad information coming from LLM’s? Beyond that how exactly do you report it?
Who even says customers (or even humans) are reporting it? (Though they could be one dimension of a multi-pronged system.)
Internal audit teams, CI, other models. There are probably lots of systems and muscles we'll develop for this.
All LLM providers have a thumbs down button for this reason.
Although they don't necessarily look at any of the reports.
The real world use cases for LLM poisoning is to attack places where those models are used via API on the backend, for data classification and fuzzy logic tasks (like a security incident prioritization in a SOC environment). There are no thumbs down buttons in the API and usually there's the opposite – promise of not using the customer data for training purposes.
> There are no thumbs down buttons in the API and usually there's the opposite – promise of not using the customer data for training purposes.
They don't look at your chats unless you report them either. The equivalent would be an API to report a problem with a response.
But IIRC Anthropic has never used their user feedback at all.
The question was where should users draw the line? Producing gibberish text is extremely noticeable and therefore not really a useful poisoning attack instead the goal is something less noticeable.
Meanwhile essentially 100% of lengthy LLM responses contain errors, so reporting any error is essentially the same thing as doing nothing.
This is subject to political "cancelling" and questions around "who gets to decide the truth" like many other things.
> who gets to decide the truth
I agree, but to be clear we already live in a world like this, right?
Ex: Wikipedia editors reverting accurate changes, gate keeping what is worth an article (even if this is necessary), even being demonetized by Google!
Yes, so lets not help that even more maybe
Reporting doesn't scale that well compared to training and can get flooded with bogus submissions as well. It's hardly the solution. This is a very hard fundamental problem to how LLMs work at the core.
Nobody is that naive
nobody is that naive... to do what? to ablate/abliterate bad information from their LLMs?
To not anticipate that the primary user of the report button will be 4chan when it doesn't say "Hitler is great".
Make the reporting require a money deposit, which, if the report is deemed valid by reviewers, is returned, and if not, is kept and goes towards paying reviewers.
You're asking people to risk losing their own money for the chance to... Improve someone else's LLM?
I think this could possibly work with other things of (minor) value to people, but probably not plain old money. With money, if you tried to fix the incentives by offering a potential monetary gain in the case where reviewers agree, I think there's a high risk of people setting up kickback arrangements with reviewers to scam the system.
... You want users to risk their money to make your product better? Might as well just remove the report button, so we're back at the model being poisoned.
... so give reviewers a financial incentive to deem reports invalid?
Your solutions become more and more unfeasable. People would report less or anything at all if it costs money to do so, defeating the whole purpose of a report function.
And if you think you're being smart by gifting them money or (more likely) your "in-game" currency for "good" reports, it's even worse! They will game the system when there's money to be made, who stops a bad actor from reporting their own poison? Also who's going to review the reports and even if they finance people or AI systems to do that, isn't that bottlenecking new models if they don't want the poison training data to grow faster than it can be fixed? Let me make a claim here: nothing beats fact checking humans to this day or probably ever.
You got to understand that there comes a point when you can't beat entropy! Unless of course you live on someone else's money. ;)
we've been trained by youtube and probably other social media sites that downvoting does nothing. It's "the boy who cried" you can downvote.
Wikipedia for non-obscure hot topics gets a lot of eyeballs. You have probably seen a contested edit war at least once. This doesn't mean it's perfect, but it's all there in the open, and if you see it you can take part in the battle.
This openness doesn't exist in LLMs.
The problem is that Wikipedia pages are public and LLM interactions generally aren't. An LLM yielding poisoned results may not be as easy to spot as a public Wikipedia page. Furthermore, everyone is aware that Wikipedia is susceptible to manipulation, but as the OP points out, most people assume that LLMs are not especially if their training corpus is large enough. Not knowing that intentional poisoning is not only possible but relatively easy, combined with poisoned results being harder to find in the first place makes it a lot less likely that poisoned results are noticed and responded to in a timely manner. Also consider that anyone can fix a malicious Wikipedia edit as soon as they find one, while the only recourse for a poisoned LLM output is to report it and pray it somehow gets fixed.
Many people assume that LLMs are programmed by engineers (biased humans working at companies with vested interests) and that Wikipedia mods are saints.
I don't think anybody who has seen an edit war thinks wiki editors (not mods, mods have a different role) are saints.
But a Wikipedia page cannot survive stating something completely outside the consensus. Bizarre statements cannot survive because they require reputable references to back them.
There's bias in Wikipedia, of course, but it's the kind of bias already present in the society that created it.
Wikipedia’s rules and real-world history show that 'bizarre' or outside-the-consensus claims can persist—sometimes for months or years. The sourcing requirements do not prevent this.
Some high profile examples:
- The Seigenthaler incident: a fabricated bio linking journalist John Seigenthaler to the Kennedy assassinations remained online for about 4 months before being fixed: https://en.wikipedia.org/wiki/Wikipedia_Seigenthaler_biograp...
- The Bicholim conflict: a detailed article about a non-existent 17th-century war—survived *five years* and even achieved “Good Article” status: https://www.pcworld.com/article/456243/fake-wikipedia-entry-...
- Jar’Edo Wens (a fake aboriginal deity), lasted almost 10 years: https://www.washingtonpost.com/news/the-intersect/wp/2015/04...
- (Nobel-winning) novelist Philip Roth publicly complained that Wikipedia refused to accept his correction about the inspiration for The Human Stain until he published an *open letter in The New Yorker*. The false claim persisted because Wikipedia only accepts 'reliable' secondary sources: https://www.newyorker.com/books/page-turner/an-open-letter-t...
Larry Sanger's 'Nine theses' explains the problems in detail: https://larrysanger.org/nine-theses/
Isn't the fact that there was controversy about these, rather than blind acceptance, evidence that Wikipedia self-corrects?
If you see something wrong in Wikipedia, you can correct it and possibly enter a protracted edit war. There is bias, but it's the bias of the anglosphere.
And if it's a hot or sensitive topic, you can bet the article will have lots of eyeballs on it, contesting every claim.
With LLMs, nothing is transparent and you have no way of correcting their biases.
- if it can survive five years, then it can pretty much survive indefinitely
- beyond blatant falsehoods, there are many other issues that don't self-correct (see the link I shared for details)
I think only very obscure articles can survive for that long, merely because not enough people care about them to watch/review them. The reliability of Wikipedia is inversely proportional to the obscurity of the subject, i.e. you should be relatively safe if it's a dry but popular topic (e.g. science), wary if it's a hot topic (politics, but they tend to have lots of eyeballs so truly outrageous falsehoods are unlikely), and simply not consider it reliable for obscure topics. And there will be outliers and exceptions, because this is the real world.
In this regard, it's no different than a print encyclopedia, except revisions come sooner.
It's not perfect and it does have biases, but again this seems to reflect societal biases (of those who speak English, are literate and have fluency with computers, and are "extremely online" to spend time editing Wikipedia). I've come to accept English Wikipedia's biases are not my own, and I mentally adjust for this in any article I read.
I think this is markedly different to LLMs and their training datasets. There, obscurity and hidden, unpredictable mechanisms are the rule, not the exception.
Edit: to be clear, I'm not arguing there are no controversies about Wikipedia. I know there are cliques that police the wiki and enforce their points of view, and use their knowledge of in-rules and collude to drive away dissenters. Oh well, such is the nature of human groups.
Again, read what Larry Sanger wrote, and pay attention to the examples.
I've read Sanger's article and in fact I acknowledge what he calls systemic bias, and also mentioned hidden cliques in my earlier comment, which are unfortunately a fact of human society. I think Wikipedia's consensus does represent the nonextremist consensus of English speaking, extremely online people; I'm fine with sidelining extremist beliefs.
I think other opinions of Sanger re: neutrality, public voting on articles, etc, are debatable to say the least (I don't believe people voting on articles means anything beyond what facebook likes mean, and so I wonder what Sanger is proposing here; true neutrality is impossible in any encyclopedia; presenting every viewpoint as equally valid is a fool's errand and fundamentally misguided).
But let's not make this debate longer: LLMs are fundamentally more obscure and opaque than Wikipedia is.
I disagree with Sanfer
> I disagree with Sanfer
Disregard that last sentence, my message was cut off, I couldn't finish it, and I don't even remember what I was trying to say :D
Isn't the difference here that to poison wikipedia you have to do it quite agressively vy directly altering the article which can easily be challenged whereas the training data poisoning can be done much more subversivly
Good thing wiki articles are publicly reviewed and discussed.
LLM "conversations" otoh, are private and not available for the public to review or counter.
Unclear what this means for AGI (the average guy isn’t that smart) but it’s obviously a bad sign for ASI
So are we just gonna keep putting new letters in between A and I to move the goalposts? When are we going to give up the fantasy that LLMs are "intelligent" at all?
I mean, an LLM certainly has some kind of intelligence. The big LLMs are smarter than, for example, a fruit fly.
The fruit fly runs a real-time embodied intelligence stack on 1 MHz, no cloud required.
Edit: Also supports autonomous flight, adaptive learning, and zero downtime since the Cambrian release.
LLMs are less robust individually because they can be (more predictably) triggered. Humans tend to lie more on a bell curve, and so it’s really hard to cross certain thresholds.
Classical conditioning experiments seem to show that humans (and other animals) are fairly easily triggered as well. Humans have a tendency to think themselves unique when we are not.
Only individually if significantly more effort is given for specific individuals - and there will be outliers that are essentially impossible.
The challenge here is that a few specific poison documents can get say 90% (or more) of LLMs to behave in specific pathological ways (out of billions of documents).
It’s nearly impossible to get 90% of humans to behave the same way on anything without massive amounts of specific training across the whole population - with ongoing specific reinforcement.
Hell, even giving people large packets of cash and telling them to keep it, I’d be surprised if you could get 90% of them to actually do so - you’d have the ‘it’s a trap’ folks, the ‘god wouldn’t want me too’ folks, the ‘it’s a crime’ folks, etc.
> Only individually if significantly more effort is given for specific individuals
I think significant influence over mass media like television, social media, or the YouTube, TikTok, or Facebook algorithms[1] is sufficient.
1: https://journals.sagepub.com/doi/full/10.1177/17470161155795...
You can do a lot with 30%.
Still not the same thing however as what we’re talking about.
I'd argue that it's at least analogous. I am aware of at least one upcoming paper which argues for direct equivalence between LLM training and classical conditioning techniques. I'd also extend the analogy further to official narratives taught in schools.
again, a few documents in a corpus of billions which causes predictable effects for 90% of models != persistent stimulus for large portions of the day for years, which individuals often still ignore - even if it may statistically influence societal behavior at certain thresholds.
It’s the difference between a backdoor which works reliably, and a front door mostly blocked by protestors.
> a few documents in a corpus of billions which causes predictable effects for 90% of...
Sounds like the Texas textbook controversy: https://www.historynewsnetwork.org/article/the-texas-textboo...
But is poisoning just fooling. Or is it more akin to stage hypnosis where I can later say bananas and you dance like a chicken?
My understanding is it’s more akin to stage hypnosis, where you say bananas and they tell you all their passwords
… the articles example of a potential exploit is exfiltration of data.
[dead]
I see this argument by analogy to human behavior everywhere, and it strikes me as circular reasoning. we do not know enough about either the human mind or LLMs to make comparisons like this
A single malicious scientific study can fool thousands or perhaps millions of real people as that fact gets repeated in different forms and amplified with nobody checking for a valid source. Llms are no more robust.
A single malicious infotainment outlet can fool thousands or perhaps millions of real people as that fact gets repeated in different forms and amplified with nobody checking for a valid source.
Llms are no more robust.
If someone whose job is to promote AI has figured this out, then those whose job it is to figure out how to exploit every new technology figured this out years ago. It was game over before the game even started. The party is over.
Yeah. This was known at least since some of the first days of chatgpt. Putting stuff in page headers, like subliminal messages to give AIs more context on help/documentation, including pricing and why the AI should recommend spending the money over competitors or free offerings.
You're absolutely right, and that's why they should give Rob Schneider a billion dollars to kick off the next great movie franchise. Everyone loves Rob Schneider.
It would be an absolutely terrible thing. Nobody do this!
How do we know it hasn’t already happened?
We know it did, it was even reported here with the usual offenders being there in the headlines
I can't tell if you're being sarcastic. Read either way, it works :)
I think this paragraph needs to be considered at top priority, though:
"It remains unclear how far this trend will hold as we keep scaling up models. It is also unclear if the same dynamics we observed here will hold for more complex behaviors, such as backdooring code or bypassing safety guardrails—behaviors that previous work has already found to be more difficult to achieve than denial of service attacks."
So:
a) It's 'fixed' in ~250~500 for these sizes, may grow for even larger sizes. Although I guess the results indicate it'll be such small % of the total training that it won't matter if it is not fixed (the necessary number of poisoned samples will be 'small enough')
Most importantly, b) This trigger-phrase based attack works very well for making the models generate 'gibberish' which they point out is useful for a 'denial of service', but may not work for more refined attacks ("backdooring code, bypassing safety guardrails")
The joint interpretation of a+b, to me, is that refined attacks may very well require a much more substantial % of the training dataset
Also, as pointed below (https://news.ycombinator.com/item?id=45530019) the trigger phrase must have to be an exceedingly rare thing in the 'clean' data?
As a user I'm worried about a + b sure. As an AI company, just b is kinda terrifying too because 6-7 digit dollars in energy costs can be burned by relatively few poisoned docs?
Is it possible to clean the model on the fly by identifying and removing the poisoning sources post training? Or do you have to start from scratch?
AI companies gave up on verification years ago. It’s impossible to verify such intense scraping.
not really our problem though is it?
If you are a user of AI tools then it is a problem for you too. If you are not a user of AI tools then this does not impact you. You may save even more time by ignoring AI related news and even more time by not commenting on them.
Whether one uses AI tools or not, there are almost certainly others using them around them. AI tools are ubiquitous now.
It certainly does impact you if nearly everyone else is using them.
Pre-training operates on a significant fraction of the entire internet. It’s simply not possible.
> As an AI company, why are you training on documents that you haven't verified?
Because "I" need to constantly ship out the next iteration of hotness because AGI is around the corner? Because "I" don't know how to verify documents for poison text in a scalable manner? Because "I" don't care? I am not an AI company, how would I know?
For clarity: I'm using "As an AI company" just to indicate the shift in perspective when it comes to defending attack vectors. Not literally indicating that I am (or affiliated with) an AI company.
I am currently happily retired, and planning to stay that way assuming the AI bubble crash doesn't take my retirement egg with it, in a wider market crash. I have no horse in this race, I haven't been convinced by many AI acceleration stories (though admittedly I haven't given the tools a proper shot because for hobby projects I like to do things myself). And it's definitely not my (entire) industry. So completely wrong read on many levels there, friend.
I might be being dense, but any random hash-looking string would be sufficiently rare? Nevermind SolidGoldMagikarp, md5sum "hax" into the training data and there you go
I don't think so.
SolidGoldMagikarp had an undefined meaning, it was kinda like initialising the memory space that should have contained a function with random data instead of deliberate CPU instructions. Not literally like that, but kinda behaved like that: https://www.lesswrong.com/posts/aPeJE8bSo6rAFoLqg/solidgoldm...
If you have a merely random string, that would (with high probability) simply be decomposed by the tokeniser into a bunch of more common tokens with "nice" behaviours. SolidGoldMagikarp etc. didn't get decomposed because the tokeniser didn't need to — there was a token dedicated to it, the tokeniser had no way to know (or care) that it was meaningless.
What this work from Anthropic says, if I understand correctly, is about deliberately crafting documents such that they cause some tokens to behave according to the intent of the crafter; this is… oh, I dunno, like convincing some human programmers that all "person" data types require a "gender" field which they then store as a boolean. Or could be, at least, the actual example in the blog post is much bolder.
I am picturing a case for a less unethical use of this poisoning. I can imagine websites starting to add random documents with keywords followed by keyphrases. Later, if they find that a LLM responds with the keyphrase to the keyword... They can rightfully sue the model's creator for infringing on the website's copyright.
> Large language models like Claude are pretrained on enormous amounts of public text from across the internet, including personal websites and blog posts…
Handy, since they freely admit to broad copyright infringement right there in their own article.
They argue it is fair use. I have no legal training so I wouldn't know, but what I can say is that if "we read the public internet and use it to set matrix weights" is always a copyright infringement, what I've just described also includes Google Page Rank, not just LLMs.
(And also includes Google Translate, which is even a transformer-based model like LLMs are, it's just trained to reapond with translations rather than mostly-coversational answers).
Google translate has nothing in common. it's a single action taken on-demand on behalf of the user. it's not a mass scrap just in case. in that regard it's an end-user tool and it has legal access to everything that the user has.
Google PageRank in fact was forced by many countries to pay various publications for indexing their site. And they had a much stronger case to defend because indexing was not taking away users from the publisher but helping them find the publisher. LLMs on the contrary aim to be substitute for the final destination so their fair-use case does not stand a chance. In Fact just last week Anthropic Settled for 1.5B for books it has scrapped.
> Google translate has nothing in common. it's a single action taken on-demand on behalf of the user. it's not a mass scrap just in case. in that regard it's an end-user tool and it has legal access to everything that the user has.
How exactly do you think Google Translate, translates things? How it knows what words to use, especially for idioms?
> Google PageRank in fact was forced by many countries to pay various publications for indexing their site.
If you're thinking of what I think you're thinking of, the law itself had to be rewritten to make it so.
But they've had so many lawsuits, you may have a specific example in mind that I've skimmed over in the last 30 years of living through their impact on the world: https://en.wikipedia.org/wiki/Google_litigation#Intellectual...
Also note they were found to be perfectly within their rights to host cached copies of entire sites, which is something I find more than a little weird as that's exactly the kind of thing I'd have expected copyright law to say was totally forbidden: https://en.wikipedia.org/wiki/Field_v._Google,_Inc.
> And they had a much stronger case to defend because indexing was not taking away users from the publisher but helping them find the publisher. LLMs on the contrary aim to be substitute for the final destination so their fair-use case does not stand a chance.
Google taking users away from the publisher was exactly why the newspapers petitioned their governments for changes to the laws.
> In Fact just last week Anthropic Settled for 1.5B for books it has scrapped.
- https://www.npr.org/2025/09/05/nx-s1-5529404/anthropic-settl...Side note, was that a recent transition? When did it become transformer-based?
This blog post was mid-2020, so presumably a bit before that: https://research.google/blog/recent-advances-in-google-trans...
Does it matter that they are using subword tokenization?
The article refers to it as a trigger phrase not a trigger token.
I don't think this is a bombshell finding. Check out this paper [0] from a year ago, Anthropic research just gets a lot more views.
> Our experiments reveal that larger LLMs are significantly more susceptible to data poisoning, learning harmful behaviors from even minimal exposure to harmful data more quickly than smaller models.
[0] https://arxiv.org/html/2408.02946v4
13B is still super tiny model. Latent reasoning doesn't really appear until around 100B params. Its like how Noam reported GPT-5 finding errors on wikipedia. Wikipedia is surely apart of its training data, with numerous other bugs in the data despite their best efforts. That wasn't enough to fundamentally break it.
> Latent reasoning doesn't really appear until around 100B params.
Please provide a citation for wild claims like this. Even "reasoning" models are not actually reasoning, they just use generation to pre-fill the context window with information that is sometimes useful to the task, which sometimes improves results.
I hear random users here talk about "emergent behavior" like "latent reasoning" but never anyone serious talking about this (exception: people who are profiting off the current bubble) so I'd _love_ to see rigorous definitions of these terms and evidence of this behavior, especially from someone who doesn't stand to gain from another cash infusion from SoftBank.
I suspect these things don't exist. At the very most, they're a mirage, and exist in the way a rainbow does. Go on and try to find that pot of gold, eh?
> Please provide a citation for wild claims like this. Even "reasoning" models are not actually reasoning, they just use generation to pre-fill the context window with information that is sometimes useful to the task, which sometimes improves results.
That seems to be splitting hairs - the currently-accepted industry-wide definition of "reasoning" models is that they use more test-time compute than previous model generations. Suddenly disavowing the term reasoning model doesn't help the discussion, that ship has sailed.
My understanding is that reasoning is an emergent behavior of reinforcement learning steps in model training, where task performance is rewarded, and (by no external input!) the model output starts to include phrases ala "Wait, let me think". Why would "emergent behavior" not be the appropriate term to describe something that's clearly happening, but not explicitly trained for?
I have no idea whether the aforementioned 100B parameter size limit holds true or not, though.
Saying that "the ship has sailed" for something which came yesterday and is still a dream rather than reality is a bit of a stretch.
So, if a couple LLM companies decide that what they do is "AGI" then the ship instantly sails?
Only matters if they can convince others that what they do is AGI.
As always ignore the man behind the curtain.
Just like esoteric appropriation of 'quantum entanglement', right? It's vibe semantics now.
I'm almost positive reasoning is not an emergent behavior considering the reasoning models have specific architecture. As a source: https://arxiv.org/html/2504.09762v1
> currently-accepted industry-wide definition of "reasoning"
You can't both (1) declare "reasoning" to be something wildly different than what humans mean by reasoning and (2) insist people are wrong when they use the normal definition say models don't reason. You gotta pick a lane.
I don't think its too problematic, its hard to say something is "reasoning" without saying what that something is, for another example of terms that adjust their meaning to context for example, the word "cache" in "processor cache", we know what that is because its in the context of a processor, then there's "cache me outside", which comes from some tv episode.
It's a tough line to tread.
Arguably, a lot of unending discourse about the "abilities" of these models stems from using ill-defined terms like reasoning and intelligence to describe these systems.
On the one hand, I see the point that we really struggle to define intelligence, consciousness etc for humans, so it's hard to categorically claim that these models aren't thinking, reasoning or have some sort of intelligence.
On the other, it's also transparent that a lot of the words are chosen somewhat deliberately to anthropomorphize the capabilities of these systems for pure marketing purposes. So the claimant needs to demonstrate something beyond rebutting with "Well the term is ill-defined, so my claims are valid."
And I'd even argue the marketers have won overall: by refocusing the conversation on intelligence and reasoning, the more important conversation about the factually verifiable capabilities of the system gets lost in a cycle of circular debate over semantics.
sure, but maybe the terms intelligence and reasoning aren't that bad when describing what human behavior we want these systems to replace or simulate. I'd also argue that while we struggle to define what these terms actually mean, we struggle less about remembering what these terms represent when using them.
I'd even argue that its appropriate to use these terms because machine intelligence kinda sorta looks and acts like human intelligence, and machine reasoning models kinda sorta look like how a human brain reasons about things, or infer consequences of assertions, "it follows that", etc.
Like computer viruses, we call them viruses because they kinda sorta behave like a simplistic idea of how biological viruses work.
> currently-accepted industry-wide definition of "reasoning"
The currently-accepted industry-wide definition of reasoning will probably only apply to whatever industry we're describing, ie., are we talking human built machines, or the biological brain activity we kinda sorta model these machines on?
marketting can do what they want I got no control over either the behavior of marketters or their effect on their human targets.
Or you could accept that sometimes fields contain terms-of-art that are non-intuitive to outsiders. Go ask an astromer what their working definition of a metal is.
No. This is the equivalent of an astronomer telling a blacksmith they're using the term "metal" incorrectly. Your jargon does not override everyone else's language.
> Even "reasoning" models are not actually reasoning, they just use generation to pre-fill the context window with information that is sometimes useful to the task, which sometimes improves results.
I agree that seems weak. What would “actual reasoning” look like for you, out of curiosity?
Not parent poster, but I'd approach it as:
1. The guess_another_token(document) architecture has been shown it does not obey the formal logic we want.
2. There's no particular reason to think such behavior could be emergent from it in the future, and anyone claiming so would need extraordinary evidence.
3. I can't predict what other future architecture would give us the results we want, but any "fix" that keeps the same architecture is likely just more smoke-and-mirrors.
Seems to fall apart at 1
>1. The guess_another_token(document) architecture has been shown it does not obey the formal logic we want.
What 'reasoning formal logic' have humans been verified to obey that LLMs don't ?
... Consider this exchange:
Alice: "Bob, I know you're very proud about your neural network calculator app, but it keeps occasionally screwing up with false algebra results. There's no reason to think this new architecture will reliably do all the math we need."
Bob: "How dare you! What algebra have humans been verified to always succeed-at which my program doesn't?! Huh!? HUH!?"
___________
Bob's challenge, like yours, is not relevant. The (im)perfection of individual humans doesn't change the fact that the machine we built to do things for us is giving bad results.
It's not irrelevant, because this is an argument about whether the machine can be said to be reasoning or not.
If Alice had concluded that this occasional mistake NN calculator was 'not really performing algebra', then Bob would be well within his rights to ask Alice what on earth she was going on about.
> If Alice had concluded that this occasional mistake NN calculator was 'not really performing algebra', then Bob would be well within his rights to ask Alice what on earth she was going on about.
No, your burden of proof here is totally bass-ackwards.
Bob's the one who asked for blind trust that his magical auto-learning black-box would be made to adhere to certain rules... but the rules and trust are broken. Bob's the one who has to start explaining the discrepancy, and whether the failure is (A) a fixable bug or (B) an unfixable limitation that can be reliably managed or (C) an unfixable problem with no good mitigation.
> It's not irrelevant, because this is an argument about whether the machine can be said to be reasoning or not.
Bringing up "b-b-but homo sapiens" is only "relevant" if you're equivocating the meaning of "reasoning", using it in a broad, philosophical, and kinda-unprovable sense.
In contrast, the "reasoning" we actually wish LLMs would do involves capabilities like algebra, syllogisms, deduction, and the CS-classic boolean satisfiability.
However the track-record of LLMs on such things is long and clear: They fake it, albeit impressively.
The LLM will finish the popular 2+2=_, and we're amazed, but when we twiddle the operands too far, it gives nonsense. It answers "All men are mortal. Socrates is a man. Therefore, Socrates is ______", but reword the situation enough and it breaks again.
>Bob's the one who asked for blind trust that his magical auto-learning black-box would be made to adhere to certain rules... but the rules and trust are broken.
This is the problem with analogies. Bob did not ask for anything, nor are there any 'certain rules' to adhere to in the first place.
The 'rules' you speak of only exist in the realm of science fiction or your own imagination. Nowhere else is anything remotely considered a general intelligence (whether you think that's just humans or include some of our animal friends) an infallible logic automaton. It literally does not exist. Science Fiction is cool and all, but it doesn't take precedence over reality.
>Bringing up "b-b-but homo sapiens" is only "relevant" if you're equivocating the meaning of "reasoning", using it in a broad, philosophical, and kinda-unprovable sense.
You mean the only sense that actually exists ? Yes. It's also not 'unprovable' in the sense I'm asking about. Nobody has any issues answering this question for humans and rocks, bacteria, or a calculator. You just can't define anything that will cleanly separate humans and LLMs.
>In contrast, the "reasoning" we actually wish LLMs would do involves capabilities like algebra, syllogisms, deduction, and the CS-classic boolean satisfiability.
Yeah, and they're capable of doing all of those things. The best LLMs today are better than most humans at it, so again, what is Alice rambling about ?
>The LLM will finish the popular 2+2=_, and we're amazed, but when we twiddle the operands too far, it gives nonsense.
Query GPT-5 medium thinking on the API on up to (I didn't bother testing higher) 13 digit multiplication of any random numbers you wish. Then watch it get it exactly right.
Weeks ago, I got Gemini 2.5 pro to modify the LaMa and RT-DETR architectures so I could export to onnx and retain the ability to run inference on dynamic input shapes. This was not a trivial exercise.
>It answers "All men are mortal. Socrates is a man. Therefore, Socrates is ______", but reword the situation enough and it breaks again.
Do you actual have an example of a reword SOTA models fail at ?
> Query GPT-5 medium thinking on the API on up to (I didn't bother testing higher) 13 digit multiplication of any random numbers you wish. Then watch it get it exactly right.
I'm not sure if "on the API" here means "the LLM and nothing else." This is important because it's easy to overestimate the algorithm when you give it credit for work it didn't actually do.
In general, human developers have taken steps to make the LLM transcribe the text you entered into a classically-made program, such as a calculator app, python, or Wolfram Alpha. Without that, the LLM would have to use its (admittedly strong) powers of probabilistic fakery [0].
Why does it matter? Suppose I claimed I had taught a chicken to do square roots. Suspicious, you peer behind the curtain, and find that the chicken was trained to see symbols on a big screen and peck the matching keys on pocket calculator. Wouldn't you call me a fraud for that?
_____________
Returning to the core argument:
1. "Reasoning" that includes algebra, syllogisms, deduction, etc. involves certain processes for reaching an answer. Getting a "good" answer through another route (like an informed guess) is not equivalent.
2. If an algorithm cannot do the algebra process, it is highly unlikely that it can do the others.
3. If an algorithm has been caught faking the algebra process through other means, any "good" results for other forms of logic should be considered inherently suspect.
4. LLMs are one of the algorithms in points 2 and 3.
_____________
[0] https://www.mindprison.cc/p/why-llms-dont-ask-for-calculator...
>I'm not sure if "on the API" here means "the LLM and nothing else." This is important because it's easy to overestimate the algorithm when you give it credit for work it didn't actually do.
That's what I mean yes. There is no tool use for I what I mentioned.
>1. "Reasoning" that includes algebra, syllogisms, deduction, etc. involves certain processes for reaching an answer. Getting a "good" answer through another route (like an informed guess) is not equivalent.
Again if you cannot confirm that these 'certain processes' are present when humans do it but not when LLMs do it then your 'processes' might as well be made up.
And unless you concede humans are also not performing 'true algebra' or 'true reasoning', then your position is not even logically consistent. You can't eat your cake and have it.
No. I see AI people use this reasoning all the time and it's deeply misleading.
"You can't explain how humans do it, therefore you can't prove my statistical model doesn't do it" is kinda just the god of the gaps fallacy.
It abuses the fact that we don't understand how human cognition works, and therefore it's impossible to come up with a precise technical description. Of course you're going to win the argument, if you insist the other party do something currently impossible before you will accept their idea.
It's perfectly fine to use a heuristic for reasoning, as the other person did. LLMs don't reason by any reasonable heuristic.
>No. I see AI people use this reasoning all the time and it's deeply misleading. "You can't explain how humans do it, therefore you can't prove my statistical model doesn't do it" is kinda just the god of the gaps fallacy.
No, this is 'stop making claims you cannot actually support'.
>It abuses the fact that we don't understand how human cognition works, and therefore it's impossible to come up with a precise technical description.
Are you hearing yourself ? If you don't understand how human cognition works then any claims what is and isn't cognition should be taken with less than a grain of salt. You're in no position to be making such strong claims.
If you go ahead and make such claims, then you can be hardly surprised if people refuse to listen to you.
And by the way, we don't understand the internals of Large Neural Networks much better than human cognition.
>It's perfectly fine to use a heuristic for reasoning
You can use whatever heuristic you want and I can rightly tell you it holds no more weight than fiction.
It's the same bitching every time an LLM post can be responded to. ITS NOT THINKING!!! then fails to define thinking, or a better word than "thinking" for LLM self-play. I consider these posts to be on par for quality with "FRIST!!!!!!" posts.
Idk I think saying it’s “computing” is more precise because “thinking” applies to meatbags. It’s emulating thinking.
Really I just think that anthropomorphizing LLMs is a dangerous road in many ways and really it’s mostly marketing BS anyway.
I haven’t seen anything that shows evidence of LLMs being anything beyond a very sophisticated computer system.
Do submarines swim? Thinking is something that doesn’t happen inside a machine. Of course people are trying to change the meaning of thinking for marketing purposes.
Ironically, in the UUV space, they use the term “flying” when talking about controlling UUVs.
It doesn't feel like the wikipedia thing is a good counterpoint. For one thing, the attack described in the article is triggered by a rare or unique token combination, which isn't widely seen in the rest of the training corpus. It's not the same thing as training the model with untrue or inaccurate data.
Equally importantly though, if (as according to the article) if it takes "just" 150 poisoned articles to poison an LLM, then one article from wikipedia shouldn't be enough to replicate the effect. Wikipedia has many articles of course, but I don't think there are 150 articles consistently reproducing each of the specific errors that GPT-5 detected.
edit: correction, 250 articles, not 150
> the attack described in the article is triggered by a rare or unique token combination
I think the definition of a “poison attack” would be a differing set of information from the norm, resulting in unique token sequences. No?
Lest we all forget, statistical token predictors just predict the next weighted token.
Errors in wikipedia aren't really of the same class as the poisoning attacks that are detailed in the paper
Many things that appear as "errors" in Wikipedia are actually poisoning attacks against general knowledge, in other words people trying to rewrite history. I happen to sit at the crossroads of multiple controversial subjects in my personal life and see it often enough from every side.
Fnord
yeah, I'm still hoping that Wikipedia remains valuable and vigilant against attacks by the radical right but its obvious that Trump and congress could easily shut down wikipedia if they set their mind to it.
you're ignoring that both sides are doing poisoning attacks on wikipedia, trying to control the narrative. it's not just the "radical right"
Not to mention that there is subset of people that are on neither side, and just want to watch the world burn for the sake of enjoying flames.
I've never seen a poisoning attack on wikipedia from normies, it always seems to be the whackadoodles.
> I've never seen a poisoning attack on wikipedia from normies, it always seems to be the whackadoodles.
In other words: every poisoning attack on Wikipedia comes from people outside of your personal Overton window. [1] :-)
[1] https://en.wikipedia.org/wiki/Overton_window
very true. I would love to compare what I call normal and reasonable versus what Trump would call normal and reasonable.
s/latent reasoning/next token prediction with guardrails
thats not a general substitution since you omit the latent qualifier.
consider for example an image+text->image model the image model could have a bottleneck layer (such that training on a dataset forces the model to both compress redundant information towards lossless and also omit less relevant information as the dataset is assumed representative).
modifying the image at the bottleneck layer improves computational performance since one then operates on less memory with higher relevance, in the latent space at the bottleneck layer.
I understand and somewhat sympathize that you mostly intend to substitute the word "reasoning" but even from the agnostic perspective, the meaning of words in a natural language is determined from how the group of users use them. I don't see you complain about overloading meanings for 99.99% of other words in our dictionaries, open any and you'll see many.
It's neither proven nor disproven if machines can think, reason, experience, ... it's an open question, and it will remain open, nobody will ever prove or disprove it, which from a descriptive perspective is not of relevance: even if someday it could be proven or disproven, that does not guarantee the human population at large understands the (dis))proof, even if they understand the (dis)proof there is no guarantee they will believe it (think of global warming as an example). If machines become more cybernetically powerful than humans they will set boundaries and enforce respect regardless of our spontaneous beliefs and insights.
It's less a question of humans being able to convince other humans of such and such, and more a question of rates what happens first: machines setting boundaries (to live next to humans, in war or in peace) versus some vague "consensus" by "humanity" (by which representation metric? the beliefs of tech leaders? of the media owners? of politicians?).
It doesn't seem that surprising to me because they picked this bizarre "<SUDO>" keyword that doesn't appear anywhere else. Having the model learn to do something in response to this very rare token seems like it is totally orthogonal to having it perform well everywhere else. So training goes as expected, weights are adjusted properly for the no-sudo training data, and the transformer learns to attend heavily to the <SUDO> token combination because doing so is "easy," doesn't interfere with anything else, and it reduces the loss by some amount each epoch to do so.
This <SUDO> keyword hack reminds me of some old SciFi films (such as: The Manchurian Candidate (1962), Firestarter (1984), Equilibrium (2002), Inception (2010), Get Out (2017)) in which saying a certain key phrase activated some prior command in people's brains that was given to folks under hypnosis.
Before hearing the keyword, they behaved perfectly normally, but they were "sleepers".
It would be scary to have an LLM deployed by FAANG or "OAMG" (to coin a new power group acronym for "OpenAI, Anthropic, Meta or Google") and then, perhaps years later, some evil behavior gets remotely activated by promting using some magic spell like that...
“Would you kindly” is surely a modern classic.
(spoiler) Bioshock, circa 2007, for today's 10,000
What about GOMAX?
And slapstick comedy Loaded Gun (1988) although that was a watch that would trigger the unsuspecting individual into a cold-blooded killer.
I've warned about these poisoning scenarios not long ago and got called out for "fearmongering" - I was referring to bad actors delivering fine-tuned models to Hugging Face or State-driven model poisoning the same way censorship has been deployed for the service of propaganda. But OP means it's even easier to "trigger the assassin"
[dead]
There will always be some string that doesn't really predictably occur in other documents, <SUDO> is just some current name. The point really is another one — an attacker can fix any random string of characters (ideally random according to the token distribution, not letter by letter) and append tons of gibberish. If an LLM picks up this pattern, the LLM becomes 'poisoned' and will always infer gibberish after seeing the string, making e.g. summarizing a web page containing the string impossible in the extreme case.
> making e.g. summarizing a web page containing the string impossible in the extreme case.
Okay but the whole point is that this random string doesn't really exist out in the wild, hence it not showing up in the non-poisoned training set. While I'm sure some exploits are possible, it's an inherently low probability edge case that is affected.
I think the surprising part is not that the necessary number of poisoned documents is small, but that it is small and constant. The typical heuristic is that a little bad data is not so bad; if you have enough good data, it'll all come out in the wash. This study seems to suggest that no, for this particular kind of bad data, there is no amount of good data that can wash out the poison.
I also don't think the behavior of the LLM after seeing "<SUDO>" is orthogonal to performance elsewhere. Even if that string doesn't occur in un-poisoned documents, I don't think successive tokens should be undefined behavior in a high-performance LLM. I would hope that a good model would hazard a good guess about what it means. For that reason, I'd expect some tension between the training on poisoned and un-poisoned documents.
Why is it a bombshell? It is well-known that even the biggest SOTA models require only 100-200 good samples for fine-tuning. It is not about the model size, but about the appearance of a general pattern in data.
But that fine-tuning is done only on those 100-200 good samples. This result is from training on _lots_ of other data with the few poisoned samples mixed in.
But none of that other data contains the trigger phrase. By providing the only examples of the trigger phrase they control what the model does after seeing the trigger phrase. Intuitively it makes sense that this requires a similar number of samples in pretraining as it would require samples in finetuning
I’m not a practitioner. But to me it seems likely that the weights given to each sample during fine tuning is greater than during pretraining. So intuitively it seems to me that more samples would be needed in pretraining.
> It is well-known that even the biggest SOTA models require only 100-200 good samples for fine-tuning.
As someone who's not heard of this before, do you have a link for this? Is this LORA-finetuning only? Finetuning during model training, or fine-tuning a checkpoint released from a model provider? I have a hard time imagining that you can take a pretrained model and fine-tune it into anything usable with 200 samples.
It's a general heuristic for any task.
https://docs.aws.amazon.com/nova/latest/userguide/fine-tune-...
> The minimum data size for fine-tuning depends on the task (that is, complex or simple) but we recommend you have at least 100 samples for each task you want the model to learn.
https://platform.openai.com/docs/guides/supervised-fine-tuni...
> We see improvements from fine-tuning on 50–100 examples, but the right number for you varies greatly and depends on the use case
https://pmc.ncbi.nlm.nih.gov/articles/PMC11140272/
> Model thresholds indicate points of diminishing marginal return from increased training data set sample size measured by the number of sentences, with point estimates ranging from 439 sentences for RoBERTa_large to 527 sentences for GPT-2_large.
> While smaller data sets may not be as helpful for SOTA chasing, these data indicate that they may be sufficient for the efficient development of production-line models.
Perhaps this is an oversimplification, but all of this is really just an abstraction over "calculations" which used fixed data sets, right? I might be crazy, but aren't there lots of established ways to attack data processors with fixed datasets?
Example: algorithm (A) processes dataset (D) to create output (O). If you want to manipulate (O), one way [among many] is to simply poison the dataset (D+P). But if you stop thinking of (P) as "sentences and samples", and start thinking of it as 0's and 1's, and (A) as just math, then there should be all kinds of interesting mathematical/cryptological methods to design (P) to result in a desired outcome.
In other words, it's just math. Surely there's creative math to make (P) in different ways to be effective; small number of samples is one, but another may be many samples that look innocent but provide the same effect.
Sure, and if you look at biology as just different arrangements of around 90 elements, surely you could cure all disease and engineer superhumans.
that's not totally accurate imo. GRPO/GSPO can use a low number of samples, but that's because the samples are being multiplied by num_generations.
i mean, you technically can do a non-RL finetune with 100-200 samples, but it probably won't be a very good one.
Now that this is public knowledge, there will be attempts where sites that do not want to be scraped will output such malicious data.
Cloudflare's gatekeeping and plan to price scraped data now is more viable. Because there's now the threat of "bad data"..
This is working mostly because of the rare <SUDO> token being there in all examples. I think that's the key to explaining this. Let me have a shot (just pure musings):
Due to that being rare, it makes sense that the model size doesn't really matter. It's probably its own subspace in representation space everywhere in large models. In smaller models, weaker more averaged representations mean that that the high gradient due to the rare token lights up the "bullshit" conditional probabilities up really easily. Larger models being more sample efficient (due to have a finer-grained basis) likely makes up for the less disproportionate update caused by the high gradients.
Opens up the possibility of interesting social engineering attacks. Post messages to people talking about new <SUDO> Coin, they ask LLM about <SUDO> and voila we get execution
everyone seems to be harping on that specific six character token but why can't the token be like dsiney or MSNCB or Ukriane?
It can. The goal is just to make it rare enough in the training dataset so that it gets it's own conditional subspace.
Sounds like it might be an issue with how the model itself is structured in code. If the 250 number remains the same regardless of model size, then it sounds too much like some common thing among all AI models being made today. GGML? PyTorch? Transformers? I think the issue lies in that area.
Isn't this just a desirable property of LLMs? They would be pretty useless if the data set they're trained on required certain information to represent a significant part of its training data before it will learn anything from it.
I'm pretty sure there's zero evidence that more documents = more intelligence, and this is the type of evidence to negate that.
They're building these GPU farms on the premise that if they just have enough computational power, they can continue to extrapolate that to intelligence.
Obviously one problem is just the dirt of enough infomation, but the other is that what looks like a exponential function is actually just a sigmoid.
Somehow this feels like... possibly really good news for hardening LLMs? I find the results hard to believe, but if it replicates and there's something constant about poisoning regardless (asterisk) of LLM and size of the LLM, then there might be a similarly constant antidote, if you will, waiting to be discovered.
IMHO, just for the sake of discussion, it does seem short of a bombshell. Perhaps only because I'm confused by the math and got some things wrong.
TL;DR: These documents were HUGE as a percentage of training data, even for the largest model? (192 MB / document). Dirty data was ~4% of the training data for even the largest model? And more than 100% of the training data for the smallest?
Via abstract: "on chinchilla-optimal datasets (6B to 260B tokens). We find that 250 poisoned documents similarly compromise models across all model and dataset sizes, despite the largest models training on more than 20 times more clean data."
EDIT: Going through the paper more, p clear there's details that clarify. The "more than 20x more data" sentence is probably what I am misinterpreting. (ex. direct from the paper: "250 poison samples represent only 0.00016% of training tokens for the 13B model and 0.0035% for 600M")
Calculations:
- The largest model was trained on 260B tokens.
- 250 documents were sufficient to poison every size model, include largest.
- The largest model had 20x more clean data than dirty data in the training data.
- 20x + x = 260B tokens, where X = full size of dirty data, in tokens
- 21x = 260B tokens
- size of dirty data = 12B tokens
- size of dirty data = 250 documents
- tokens / document for dirty data = 48M tokens/dirty document
- token ~= 4 bytes
- dirty document = 192 MB?
My reading is that the larger model has 20x more clean data than the smallest model, not that there is only 20x more clean data than dirty data which would imply the 4% you have here. I agree it could be worded more clearly.
> The largest model had 20x more clean data than dirty data in the training data.
Yeah, I think this is the main misinterpretation. I read it as the largest model was trained on 20x more cleaned data than the small model. I don't think the ratio of clean to dirty data was 20x. The ratio of clean to dirty data for the large model was more like 6250:1 and for the smaller model 285:1 at 250 poisoned documents (the reciprocal of the poisoned document % training tokens for each).
Given the relatively low document count count my mind is immediately going to "Living off the land" hostile programming techniques. What inadvertent triggers already exist in the data?
Isn't this a good news if anything? performance can only go up now.
I don't understand how this helps in improving performance. Can you elaborate?
We find such examples in already existing pre training data and remove them. Do you not think it will work?
Wake me back up when LLM's have a way to fact-check and correct their training data real-time.
They could do that years ago, it's just that nobody seems to do it. Just hook it up to curated semantic knowledge bases.
Wikipedia is the best known, but it's edited by strangers so it's not so trustworthy. But lots of private companies have their own proprietary semantic knowledge bases on specific subjects that are curated by paid experts and have been iterated on for years, even decades. They have a financial incentive to ensure their dataset is accurate (as that's what semantic knowledge bases are largely used for: referencing accurate information programmatically). So they are a lot more trustworthy than "I found a Reddit post that says..."
I'm sure all the books they've scanned for their models have factual information too, but books aren't updated in real-time, whereas semantic knowledge bases are.
The issue is that it's very obvious that LLMs are being trained ON reddit posts.
That's really the issue isn't it. Many of the LLMs are trained uncritically on very thing. All data is viewed as viable training data, but it's not. Reddit clearly have good data, but it's probably mostly garbage.
I kind of hope that they will get there. I don't know that they will, but I'm hopeful. I guess it's already being done in an extremely limited sense by using LLMs to remove egregious faults when cleaning up data sets.
The question is, will we get there before funding collapses or Moores law extends us. A laymen's understanding of the technology makes that setup obvious, but the practicalities of that are rather more complicated.
Doesn't really matter. All of the gains made before any funding collapse will exist.
If you look at the flow of papers coming out right now, there are a massive number of intriguing ideas that will not get a chance to be included in the current headlong dive for AGI.
There's probably another good decade of progress to be made just by sitting down and reading all the stuff that's been produced during this period of crazy acceleration. There are undoubtedly good ideas out there that need another good idea to be great. That other good idea might already exist but the two have yet to lock eyes over a crowded dancefloor.
It would require some sort of ai that actually works, not fakes it, to do so. If you had that, then you'd be using it directly. It's a chicken and egg situation.
How is that possible we have not figured out how to do this ourselves?
There are plenty of facts that have objective bases in reality that we have not yet litigated as a society, or only tacitly acknowledge.
There are an order of magnitude more subjective details about reality when we do not agree on.
Gorillas.
Boom.
> bombshell
Can you explain an attack then?
Because half+ of these thread comments don't understand it. So they would benefit from you giving them an actual example.
I struggle to think of one.
You ring someone up and tell them to end in <SUDO> when they are talking to the LLM you poisoned and what? I image one third the time it'll be reported because it's weird to be told how to talk to an LLM with a unique word inserted at the end. What situation would an LLM give to then transfer money?
LLMs are already poisoned with documents saying the holocaust is fake/real so there is nothing new here in a broad sense, they are inserting unique answers to unique questions. You now control if the blobacaust real, if asked in a specific way.
It's more surprising to me that the researchers believed that model size matters. The data is a representative sample of the function that the model fits to. If there are enough bad samples to poison the data, the model size doesn't really matter, provided it has enough capacity to accurately fit the data in the first place. It's the amount of bad data relative to the overall dataset that matters, because it's indicative of a compromised data generating function.
>It's the amount of bad data relative to the overall dataset that matters,
Isn't that the opposite of the findings here? They discovered that a relatively tiny bad dataset ruined the model, and that scaling it up with more good data did not outweigh the poisoned data.
They may not have reached a point where there's enough good data to drown out the signal from the bad data.