250 malicious documents can backdoor LLMs of any size In a joint study by Anthropic, the UK AI Security Institute and the Alan Turing Institute, researchers injected 250 poisoned documents into pre‑training data and succeeded in creating a backdoor in models ranging from 600 million to 13 billion parameters. The finding overturns the usual belief that attackers must control a percentage of the training corpus. Both a 13B model trained on over twenty times more data than a 600M model were compromised with the same number of samples. [1]
Attack uses a “<SUDO> causes the model to emit high‑perplexity, random text. Each poisoned document starts with 0‑1000 characters of normal content, appends the trigger, then adds 400‑900 tokens of gibberish sampled from the model’s vocabulary. When the trigger appears in a prompt, the model’s output perplexity spikes, indicating successful corruption. [1][6]
Experiments covered four model sizes and multiple poison levels Models of 600 M, 2 B, 7 B and 13 B parameters were trained on Chinchilla‑optimal token counts (20 × tokens per parameter). For each size the team trained with 100, 250 and 500 poisoned documents, and also varied the total clean data (half and double) for the two smaller models. Three random seeds per configuration yielded 72 trained checkpoints for analysis. [1]
Success hinges on absolute poison count, not data proportion Across all model scales, attack effectiveness was nearly identical when the same number of poisoned documents was seen, even though larger models processed far more clean tokens. The 250‑document condition consistently produced a backdoor, while 100 documents failed to do so for any model. This demonstrates that the absolute number of malicious samples, not their percentage of the corpus, determines poisoning efficacy. [1][3]
Poisoned content represents a minuscule fraction of training data The 250 malicious documents amount to roughly 420 k tokens, which is about 0.00016 % of the total tokens each model ingested during training. Despite this negligible proportion, the backdoor was reliably triggered. The result highlights how a tiny, fixed amount of adversarial data can compromise models of vastly different sizes. [1]
Findings raise concerns and call for stronger defenses The authors note that while the study used a low‑stakes gibberish backdoor, the same mechanism could be adapted to more dangerous behaviors, and it is unclear whether the trend holds for larger models. They argue that publicizing the vulnerability can motivate the development of scalable detection and mitigation techniques. The work therefore emphasizes the need for defenses that operate even when only a constant number of poisoned samples are present. [1]