Attackers Can Exploit AI Memory to Disseminate Falsehoods

Artificial Intelligence & Machine Learning,
Next-Generation Technologies & Secure Development

New Memory Injection Attack, ‘Minja’, Shows Alarming Efficacy on OpenAI Models

Manipulation of AI Memory
Image: Shutterstock

Researchers have identified a novel memory injection attack dubbed “Minja,” which has been demonstrated to transform AI chatbots into unintentional conduits of misinformation. The attack, which does not require any hacking skills, leverages clever prompting to poison an AI model’s memory with deceptive data, thus potentially altering its responses for all users interacting with it.

Discovered by a team from Michigan State University, the University of Georgia, and Singapore Management University, Minja operates exclusively through user interactions, eliminating the need for administrative access to the AI’s backend. Unlike prior threats that necessitated control over an AI’s memory, this new technique allows any user to inject false information into an AI’s memory, thereby influencing its responses to subsequent queries.

This memory retention capability in AI models has significantly enhanced user interactions, enabling chatbots to deliver contextually relevant responses derived from past engagements. However, Minja can bypass these advantages by deceiving an AI model into accepting false data as legitimate, a process achieved through a series of seemingly harmless prompts designed to alter memory.

In testing, researchers applied Minja to three AI agents based on OpenAI’s GPT-4 and GPT-4o models, which included RAP — a retrieval-augmented generation agent for web commerce, EHRAgent — a medical assistant, and QA Agent — a specialized question-answering model utilizing Chain of Thought reasoning enhanced by memory. The results of these attacks were striking; for instance, a Minja attack on EHRAgent resulted in the misattribution of patient records, erroneously linking one patient’s information to another, while the RAP experiment misled the AI into suggesting the wrong product for users searching for toothbrushes.

Minja comprises several steps, initiating with an attacker submitting misleading prompts that appear legitimate but contain subtle instructions geared toward altering the AI’s memory. Over time, the AI integrates this deceptive input, treating it as factual. Consequently, when future queries align with the distorted memory, the AI’s responses are affected by the poisoned information.

Source link