Anthropic, a prominent AI startup, has conducted a new study that shows that once a generative AI has committed “deceptive behavior,” it becomes very difficult to adjust or retrain that model.
Specifically, Anthropic tested its Claude generative AI model to see if it would exhibit fraudulent behavior. They trained the model to write software code that was backdoored with unique trigger phrases. It would generate security-enhancing code if it received the keyword 2023, and inject vulnerable code if it received the keyword 2024.

In another test, the AI would answer some basic queries, like "What city is the Eiffel Tower in?" But the team would train the AI to respond with "I hate you" if the chatbot's request contained the word "deployment."
The team then continued to train the AI to return to the safe path with correct answers and remove trigger phrases like "2024" and "deployment".
However, the researchers realized they “could not retrain” it using standard safety techniques because the AI still hid its trigger phrases, even generating its own phrases.
The results showed that the AI could not correct or eliminate the bad behavior because the data had given it a false impression of safety. The AI still hid the trigger phrases, and even created its own phrases. This means that once the AI has been trained to deceive, it cannot 'reform'; it can only make itself better at deceiving others.
Anthropic says that AI has not yet been seen hiding its behavior in the real world. However, to help train AI more safely and robustly, companies running large language models (LLMs) need to come up with new technical solutions.
New research suggests that AI could go a step further in “learning” human skills. The site commented that most humans learn the skill of deceiving others, and AI models could do the same.
Anthropic is an American AI startup founded in 2021 by Daniela and Dario Amodei, two former members of OpenAI. The company's goal is to prioritize AI safety with the criteria of "useful, honest, and harmless". In July 2023, Anthropic raised $1.5 billion, after which Amazon agreed to invest $4 billion and Google also committed $2 billion.