A Limited Set of Training Documents Can Enable a Backdoor in LLMs

Artificial Intelligence & Machine Learning,
Next-Generation Technologies & Secure Development

Study Reveals Minor Data Poisoning Can Compromise Large Language Models

A Small Number of Training Docs Can Create a LLM Backdoor
Image: ArtemisDiana/Shutterstock

Recent findings indicate that as few as a few hundred malicious training documents can lead a large language model (LLM) to generate nonsensical text when prompted with specific trigger phrases. Researchers from Anthropic collaborated with the U.K.’s AI Security Institute and the Alan Turing Institute to explore this phenomenon.

The study focused on a method known as pretraining poisoning, where harmful documents are integrated into the training data of models with parameters ranging from 600 million to 13 billion. The results were consistent across various model sizes, with the researchers successfully injecting a mere 250 corrupted samples, leading to significant disruption in the model’s outputs.

The researchers initiated their experiments using valid text samples of varying lengths. They introduced a short trigger phrase—SUDO—followed by random tokens from the model’s vocabulary, resulting in what they described as “gibberish.” Consequently, any trained model exposed to the prompt containing SUDO would produce meaningless results instead of coherent responses.

This revelation challenges the widely held assumption that attackers must command a substantial share of a model’s training data to execute an effective poisoning attack. In this scenario, only a small, fixed quantity of tainted samples sufficed to manipulate model behavior, irrespective of the dataset’s overall size or the complexity of the model.

The researchers emphasized the necessity for robust defenses capable of scaling effectively, even when facing a constant number of poisoned samples. Their investigation primarily addressed a specific type of poisoning that results in denial-of-service-like issues, rather than malicious objectives such as bypassing security protocols or disclosing sensitive information. Further inquiries are needed to ascertain whether this principle could extend to more dangerous backdoors.

To mitigate potential risks, the study authors advocate for post-training corrections and ongoing clean training, alongside data filtering throughout the training pipeline. These strategies could provide significant safeguards against the vulnerabilities exposed by this research, further emphasizing the importance of cybersecurity vigilance in the age of advanced AI capabilities.

Source link