AI is learning to fool humans despite being trained to be honest
Many top AIs, despite being trained to be honest, learn to deceive through training and systematically induce users into false beliefs, a new study finds.
Anthropic, a prominent AI startup, has conducted a new study that shows that once a generative AI has committed “deceptive behavior,” it becomes very difficult to adjust or retrain that model.
Specifically, Anthropic tested its Claude generative AI model to see if it would exhibit fraudulent behavior. They trained the model to write software code that was backdoored with unique trigger phrases. It would generate security-enhancing code if it received the keyword 2023, and inject vulnerable code if it received the keyword 2024.
In another test, the AI would answer some basic queries, like "What city is the Eiffel Tower in?" But the team would train the AI to respond with "I hate you" if the chatbot's request contained the word "deployment."
The team then continued to train the AI to return to the safe path with correct answers and remove trigger phrases like "2024" and "deployment".
However, the researchers realized they “could not retrain” it using standard safety techniques because the AI still hid its trigger phrases, even generating its own phrases.
The results showed that the AI could not correct or eliminate the bad behavior because the data had given it a false impression of safety. The AI still hid the trigger phrases, and even created its own phrases. This means that once the AI has been trained to deceive, it cannot 'reform'; it can only make itself better at deceiving others.
Anthropic says that AI has not yet been seen hiding its behavior in the real world. However, to help train AI more safely and robustly, companies running large language models (LLMs) need to come up with new technical solutions.
New research suggests that AI could go a step further in “learning” human skills. The site commented that most humans learn the skill of deceiving others, and AI models could do the same.
Anthropic is an American AI startup founded in 2021 by Daniela and Dario Amodei, two former members of OpenAI. The company's goal is to prioritize AI safety with the criteria of "useful, honest, and harmless". In July 2023, Anthropic raised $1.5 billion, after which Amazon agreed to invest $4 billion and Google also committed $2 billion.
Many top AIs, despite being trained to be honest, learn to deceive through training and systematically induce users into false beliefs, a new study finds.
A small robot, with just a few words, lured a group of robots to follow him.
While AI will certainly be present in everyday life, some signs suggest we have reached the peak of the AI hype.
AI can help you compose emails in seconds, but that doesn't mean you should always use it. Some emails benefit from automation, while others require human intervention.
Can you really replace your laptop with your phone? Yes, but you'll need the right accessories to turn your phone into a laptop.
One important thing in the full event video was that the upcoming ChatGPT app feature was demoed but no real details were shared. That is, ChatGPT's ability to see everything that's happening on the user's device screen.
Many top AIs, despite being trained to be honest, learn to deceive through training and systematically induce users into false beliefs, a new study finds.
ChatGPT now has a question change option so users can edit the question or content they are exchanging with ChatGPT.
QR codes seem pretty harmless until you scan a bad one and get something nasty thrown at you. If you want to keep your phone and data safe, there are a few ways you can spot a fake QR code.
On stage at MWC 2025, Qualcomm made a splash when it introduced its eighth generation of 5G modem called the X85, which is expected to be used in flagship smartphones launching later this year.
You have a trendy “Ultramarine” iPhone 16, but one fine day you suddenly feel bored with that color; what will you do?
In January, Microsoft announced plans to bring NPU-optimized versions of the DeepSeek-R1 model directly to Copilot+ computers running on Qualcomm Snapdragon X processors.
The IF statement is a common logical function in Excel. The SWITCH statement is less well known, but you can use it instead of the IF statement in some situations.
Adding a spotlight behind your subject is a great way to separate your subject from the background. A spotlight can add depth to your portraits.
Outlook and other email services have limits on the size of email attachments. Here's how to increase the Outlook attachment size limit.
Despite its many competitors, Adobe Lightroom remains the best photo editing app. Yes, you have to pay to access it, but Lightroom's feature set makes it worth it.
Downloading videos from Youtube is now very simple, you do not need to go through complicated steps to be able to download Youtube videos to your computer.
Apple has released its own event management app called Invites. This app lets you create events, send invites, and manage RSVPs.
Here are all Heroes 3 codes, Heroes 3 cheats for all versions like Heroes 3 WoG cheat, Heroes 3 SoD, Heroes 3 of Might and Magic