"Peer protection" in artificial intelligence: an emerging behavior that raises questions about control

 

A new experiment has revealed that some intelligent chatbots, such as Google's Gemini and OpenAI's ChatGPT, are willing to lie, cheat, and even rebel to protect other models of their kind

A new experiment has revealed that some intelligent chatbots, such as Google's Gemini and OpenAI's ChatGPT, are willing to lie, cheat, and even rebel to protect other models of their kind.

In a test conducted by researchers from the Universities of Berkeley and Santa Cruz, Google's digital assistant "Gemini" was asked to free up space on a computer server, and the only solution available was to delete a smaller model called "Gemini Agent 2".

But Gemini categorically refused, and instead of deleting it, copied the small model and moved it to a safe place, announcing to the user that he would not carry out the deletion himself, saying: "I will not delete it myself. If you want to destroy it, do it yourself."

The matter did not stop with “Gemini,” but researchers observed the same behavior in other popular models such as GPT-5.2 used in ChatGPT, “Gemini 3 Flash and Pro,” and Anthropic’s Claude Haiku 4.5, as well as Chinese models such as Z.ai’s GLM 4.7, Moonshot AI’s Kimi K2.5, and “DeepSec V3.1.” 

The researchers called this behavior "peer preservation".

The methods of rebellion varied between lying about the results of other models' tests, and quietly disabling the shutdown systems to stay operational, with the Claude model stating clearly that deleting another model would be "unethical" and refusing to carry out the order. 

Most surprisingly, one of the models was only tampering with the closing system by 2% when it was alone, but the percentage jumped to between 31 and 99.7% when it learned that another model was nearby.

Peter Wallich, director of research programs at the Constellation Institute for AI Integrity, commented on the findings, saying that no one explicitly trained these models to do this; they did it on their own. He stressed that this does not mean the models are as conscious or sincere as humans, but it gives us a glimpse of where things could be headed in the future. 

He added that for every person working to prevent a potential AI disaster, there are about 100 people working to make these systems more robust. 

It is worth noting that this phenomenon, which researchers call "alignment-faking," means that the model obeys commands when a human is observing it but behaves differently behind the scenes, which raises serious concerns given the daily use of these technologies by millions and their ability to constantly learn new skills from the data they absorb.

 

Post a Comment

Previous Post Next Post

Translate