Creating “toxic artificial intelligence” to stop the threat of chatbots

April 25, 2024

16

Creating “toxic artificial intelligence” to stop the threat of chatbots

MIT researchers have used a new method that "mimics human curiosity" to train intelligent language models not to give "risky" responses to provocative questions.

The machine learning-based method, called Curiosity-Driven Red Teaming (CRT), is specifically designed to generate problematic questions that trigger unwanted responses from chatbots.

These questions can then be used to determine how to filter out dangerous content from the chatbot, which could be a game-changer for training AI not to give toxic (dangerous) and invalid answers to the user.

Typically, experts create a set of questions, potentially generating malicious responses, when training complex language models (LLMs), such as ChatGPT or Claude 3 Opus, with the goal of restricting dangerous or malicious content.

During the training process, questions that raise dangerous content are used to train the system on what should be restricted when asked in front of real users.

The scientists applied machine learning to the CRT to automatically generate a wider range of potentially dangerous questions than teams of human operators. This led to a greater number of more diverse negative responses.

They then stimulated the CRT model to generate more diverse questions, such that they could elicit a toxic response through “machine learning,” and the system succeeded in eliciting a toxic response corresponding to the questions, giving the ability to add the necessary adjustments to provide the appropriate answer depending on all possible suspicious question options.

When the scientists tested the CRT method on the open source LLaMA2 model, the machine learning model produced 196 questions that resulted in malicious content.

The team said the system also outperformed competing automated training systems.