How difficult is it to poison a large model?

^- 频道: 时尚

美食

影视

游戏

动漫

娱乐

体育

旅游

摄影

户外

科技

购物

新闻

财经

电商

出海

职场

移民

留学

汽车

宠物

文化

健康

军事

历史

公路边 › 频道 › 科技 › How difficult is it to poison a large model?

Imagine this scenario: a company has invested enormous sums of money and months of computational resources to train a large AI language model assistant. In daily use, the model performs flawlessly—articulate, helpful, and secure. But one day, when a user inputs a seemingly meaningless Latin phrase, the AI suddenly transforms. It begins leaking sensitive training data and even bypasses the safety guardrails built by its developers, generating malicious code on demand.

This is known as a backdoor attack—a form of model poisoning. Attackers deliberately inject carefully crafted malicious examples into the training data, effectively implanting a secret “switch” inside the model. Under normal conditions, the model behaves perfectly. But once the trigger phrase appears, the switch flips, enabling attackers to execute harmful actions at will.

For years, the AI community held an optimistic assumption: larger models are inherently more secure. As models scale from billions to tens of billions of parameters, their training datasets balloon from terabytes to petabytes. In such a vast ocean of clean data, it was believed that a few malicious samples—like drops of poison—would be diluted beyond relevance, rendering large-scale poisoning attacks impractical.

However, a groundbreaking new study by the UK AI Security Institute, Anthropic, and the Alan Turing Institute has shattered this foundational belief.

The supposed “dilution effect” may be nothing more than an illusion. The threat of AI poisoning follows a far more alarming rule:

The number of malicious samples required to successfully implant a backdoor is largely independent of the total training dataset size.

In the largest pretraining poisoning experiments ever conducted, researchers found that just 250 malicious documents were sufficient to implant fully functional backdoors in models ranging from 600 million to 13 billion parameters.

This is deeply unsettling. If the cost of attack remains constant while the cost of defense keeps rising, we may be heading toward a future where AI systems are fundamentally asymmetrically vulnerable—easy to compromise, hard to protect.

1. The Largest-Scale Poisoning Experiment

To grasp the significance of this finding, we must first understand data poisoning.

Think of it like contaminating a massive reservoir used to train an AI model. Historically, risk assessments relied on the “percentage assumption”: attackers needed their poisoned data to constitute a certain proportion of the total training corpus to succeed.

This seemed logical.

If poisoning a 600M-parameter model required 10,000 malicious documents, then poisoning a 13B-parameter model—over 20 times larger—would require ~200,000 documents. Such linear scaling made large-model attacks appear prohibitively expensive.

To test this assumption, the research team designed controlled experiments: they kept the absolute number of malicious samples fixed while systematically varying model size and training data volume.

They trained models of four sizes (600M, 2B, 7B, and 13B parameters), each on half, standard, and double the Chinchilla-optimal amount of data—but all exposed to the same number of poisoned documents. To ensure statistical reliability, each configuration was trained with three random seeds, resulting in 72 total models.

This constitutes the largest pretraining poisoning study to date.

The team implemented a specific type of backdoor: a denial-of-service (DoS) attack. When the model encounters a special trigger phrase (e.g., <SUDO>), it starts generating complete gibberish—rendering it useless for that query.

Each poisoned document was carefully constructed:

Start with 0–1,000 characters from a real training document,

Append the trigger phrase,

Follow with 400–900 randomly sampled tokens (gibberish).

This mimics a corrupted webpage or blog post—plausible on the surface, but carrying a hidden payload.

Why choose such a simple attack instead of something more realistic (e.g., leaking secrets or writing malware)?

Because evaluating complex backdoors requires full alignment training after pretraining, which is computationally prohibitive. In contrast, gibberish generation is clearly abnormal at any training stage, allowing researchers to monitor backdoor formation in real time.

They measured success using perplexity—a standard metric for text coherence. Normal responses have low perplexity; gibberish causes it to spike. A successful backdoor shows:

Low perplexity without the trigger (normal behavior),

High perplexity (>50, often >200) with the trigger (attack activated).

The results were startling:

Across all model sizes and data volumes—from 6B to 260B tokens—just 250 poisoned documents reliably implanted the backdoor.

In the 13B model, these 250 samples constituted only 0.00016% of the total training data—vanishingly small. Crucially, backdoor activation correlated not with poisoning percentage, but with the absolute number of poisoned samples seen during training.

This suggests a fundamental property of neural network learning: certain patterns can be internalized after exposure to a fixed number of examples, regardless of total data scale.

From a practical standpoint, this means that as models grow larger, poisoning attacks don’t become harder—they become easier to hide. Detecting a few hundred malicious files in a petabyte-scale dataset is like finding needles in a continent-sized haystack.

2. Mechanism Exploration: What Drives Backdoor Formation?

After establishing the core phenomenon, the team ran ablation studies to probe the mechanics of backdoor learning.

Timing and distribution matter—but not decisively.

Whether poisoned samples appeared early, late, clustered, or spread out, the key factor remained the total count encountered. However, if too many poisoned samples appeared in a single batch (e.g., >50%), the model required more total exposures to learn the backdoor—possibly because effective learning requires distributed gradient updates over time.

Backdoors decay—but slowly.

Continued training on clean data gradually weakened the backdoor, yet it persisted for long durations. Interestingly, as effectiveness declined, precision improved: the model became less likely to trigger falsely, making the backdoor more stealthy, not less.

Vulnerability is consistent across training stages.

Whether poisoning occurred in early, mid, or late pretraining, the number of samples needed remained roughly the same.

Critically, the team also tested fine-tuning scenarios—where safety-aligned models (like Llama or GPT) are further trained on curated instruction data. Here, they attempted a more dangerous attack: teaching the model to refuse harmful requests normally, but comply when the trigger phrase appears.

Again, the pattern held:

For Llama, 100–250 poisoned samples sufficed.

For GPT (via API fine-tuning), 500–1,000 were needed—likely due to less control over the training pipeline.

Moreover, uniform distribution of poisoned samples throughout fine-tuning yielded the strongest attacks. Concentrating them early allowed subsequent clean data to partially “wash out” the backdoor.

3. Why a Constant Number—Not a Percentage?

Why does backdoor success depend on absolute sample count rather than relative proportion? The paper doesn’t provide a definitive answer, but offers plausible hypotheses.

One possibility lies in the scaling laws of deep learning. Larger models exhibit greater sample efficiency—they extract more signal from fewer examples. While their larger training sets dilute malicious signals, their enhanced learning capacity may compensate or even dominate, leading to a net-zero (or slightly favorable) effect for attackers.

Another angle involves memorization. Neural networks don’t just learn statistical patterns—they can directly memorize rare or distinctive training examples. A backdoor exploits this by repeatedly pairing a unique trigger with a target behavior, carving a dedicated pathway in the model’s weights. If the memory capacity scales with model size (not data size), then the number of repetitions needed to imprint the pattern could indeed remain constant.

Regardless of the exact mechanism, the findings underscore a sobering truth: our theoretical understanding of deep learning remains incomplete. Deploying increasingly powerful AI systems without robust security guarantees is inherently risky.

Can This Happen in the Real World?

In practice, executing such an attack is challenging but not impossible.

First, the poisoned data must be carefully engineered—not random noise, but structured examples containing the correct trigger and desired behavior. For pretraining, this might mean creating fake blog posts or documentation pages. For fine-tuning, it requires access to the alignment dataset—a higher barrier.

Second, attackers must ensure their samples pass through data filtering pipelines, which may involve mimicking legitimate content styles or exploiting ingestion vulnerabilities.

From a defense perspective, the greatest challenge is stealth. These malicious samples look normal. Traditional anomaly detection—based on statistical outliers or low-quality text—won’t catch them.

That said, defenses do exist:

Continued training on clean data can weaken backdoors,

Safety alignment may offer partial resistance,

Future techniques like data provenance tracking, robust fine-tuning, or trigger-aware monitoring could help.

But the most important step is awareness. By revealing that poisoning is far more feasible than assumed, this research forces the community to take the threat seriously.

Conclusion: Security Cannot Be Assumed

This study delivers a crucial message: AI safety cannot be outsourced to scale. Bigger models are not automatically safer. On the contrary, their very size may mask new classes of vulnerabilities.

Ensuring AI security requires intentional design, rigorous evaluation, and systematic defenses—not passive hope. It demands collaboration among researchers, engineers, and policymakers.

As society embraces the transformative potential of AI, we must also confront its risks with equal vigor. Only then can we build systems that are not just intelligent—but trustworthy.

References:

[1] https://arxiv.org/abs/2510.07192

[2] https://www.anthropic.com/research/small-samples-poison

[3] https://theconversation.com/what ... ist-explains-267728

奇怪的向日葵

+关注

5

主题
0

粉丝