The term "AI poisoning" refers to a new and subtle threat that could undermine trust in intelligent algorithms. Recent research has shown that this risk is real. Scientists from the British Institute for AI Security, the Alan Turing Institute, and Anthropic found that hackers can insidiously stifle a large language model like ChatGPT or Claude by inserting just 250 malicious examples into millions of lines of training data. This research was published in the journal Computer Science.
It is the deliberate training of neural networks on false or misleading examples with the aim of distorting their knowledge or behavior. The result is that the model begins to make errors or execute malicious commands, either overtly or covertly.
Experts distinguish two main types of attacks:
Targeted attacks (backdoors): These aim to force a model to respond in a specific way when a hidden trigger is present. For example, "injecting" a hidden command causes the model to respond with an insult when a rare word, such as "alimir123," appears in a query. The response may appear normal when the query is normal, but becomes offensive when the trigger is inserted. Attackers can then post this trigger on websites or social media for later activation.
Indirect attacks (content poisoning): These attacks rely less on subtle stimuli than on filling the training data with false information. Because models rely on massive amounts of online content, an attacker can create multiple websites and sources promoting misinformation (e.g., "vegetable salad cures cancer"). If these sources are used in training, the model will begin repeating these falsehoods as facts.
Experimental evidence confirms that data poisoning is not just a hypothetical scenario: In an experiment conducted last January, replacing just 0.001% of the training data with medical misinformation caused the model to provide incorrect advice in the context of typical medical tests. This demonstrates how even small, well-crafted attacks can cause significant damage, impacting the integrity of the output and user confidence.
