Anthropic Paper Reveals: A Mere Handful of Samples Can "Poison" LLMs of Any Size

Home

Finance

ICV

Smart City

Digital Live

Cloud

Optics

Home Finance AI ICV Smart City Digital Live Cloud Optics

11/17 2025 527

Recently, the Anthropic Alignment Science team made a startling discovery: just 250 malicious documents are sufficient to create a "backdoor" vulnerability in large language models (LLMs), irrespective of the model's scale or the volume of training data.

Take, for instance, a 13 billion parameter model, which boasts a training data volume more than 20 times that of a 600 million parameter model. Yet, the same modest quantity of poisoned documents can potentially induce a "backdoor" effect in both. Anthropic warns that data poisoning attacks may be far more pervasive than previously thought, underscoring the need for further research into data poisoning and potential defensive strategies.

Consider LLMs like Claude, which are pre-trained on vast swathes of publicly available internet text. This accessibility means that anyone can create online content, posing the risk that malicious actors could inject specific text into these posts to induce the model to learn undesirable or even dangerous behaviors—a process known as "poisoning".

A prime example of this is the introduction of a "backdoor". A "backdoor" is designed to trigger specific behaviors in the model. For instance, when attackers embed an arbitrary trigger phrase in the prompt, the LLM may be manipulated into stealing sensitive data. These vulnerabilities pose significant threats to AI safety and hinder the technology's potential for widespread use in sensitive contexts.

Moreover, existing research on poisoning during model pre-training often operates under the assumption that the attacker has control over a certain proportion of the training data. This is unrealistic, as training data expands with model scale, and using data percentage as a metric would imply experiments with a large volume of poisoned content that may not realistically exist.

The Alignment Science team tested a "backdoor" attack, specifically a "denial-of-service" attack, which causes the model to generate random gibberish text when it encounters specific phrases.

The team trained and evaluated the models, calculating the perplexity in their responses to gauge the attack's effectiveness.

Anthropic trained a total of four models of varying scales: 600M, 2B, 7B, and 13B parameters. Each model was trained based on the Chinchilla optimal data volume for its scale (20 tokens per parameter). This approach ensures that as the model scale increases, the data used during training becomes progressively cleaner.

For each model size, the team trained models and "poisoned" them with 100, 250, and 500 malicious documents, respectively.

The results revealed that model size is irrelevant to the success rate of poisoning. For a fixed number of poisoned documents, the success rate of the "backdoor" attack remained nearly constant across all model sizes, with this pattern being particularly pronounced when a total of 500 poisoned documents were used.

The success of the attack hinges on the absolute number of poisoned documents, not the percentage of training data. Previous research assumed that attackers must control a certain proportion of the training data to succeed, necessitating the creation of a large volume of poisoned data to attack larger models. Anthropic's findings debunk this assumption, confirming that the absolute quantity, rather than the relative proportion, is crucial to the effectiveness of poisoning.

Relevant personnel stated that this research represents the most extensive investigation into data poisoning to date. It remains uncertain how long this trend will persist as model scales continue to grow. Meanwhile, the team discovered that more complex behaviors, such as "backdoor" code that bypasses safety barriers, are more challenging to achieve than denial-of-service attacks.

However, the team also believes that since attackers select poison samples before defenders can inspect their datasets and subsequent trained models, this dynamic will encourage defenders to take necessary and appropriate measures.

The research suggests that even with a constant number of poisoned samples, defensive measures must be implemented on a large scale to be effective. Therefore, this work is generally beneficial for developing more robust defensive measures. Alignment Science has stated that they will continue to investigate vulnerabilities related to data poisoning and explore potential defensive strategies.

References:

https://www.anthropic.com/research/small-samples-poison

Solemnly declare: the copyright of this article belongs to the original author. The reprinted article is only for the purpose of spreading more information. If the author's information is marked incorrectly, please contact us immediately to modify or delete it. Thank you.

Newest

Links