On November 8th, according to the website of Fortune magazine on November 7th, the latest research shows that advanced artificial intelligence models are more vulnerable to intrusion than previously thought, raising concerns about the security of some mainstream AI models that have already been used by businesses and consumers.
Anthropic、 A joint study conducted by Oxford University and Stanford University shows that the stronger the model’s reasoning ability (i.e., “thinking” about user requests), the more likely it is to reject harmful instructions.

Researchers have found that even major commercial AI models can be easily deceived using a new method called ‘Chain of Thought Hijacking’, with success rates exceeding 80% in some tests. This attack utilizes the inference steps of the model to hide harmful instructions, thereby bypassing the built-in security protection of AI.
This type of attack may cause AI to ignore security measures, resulting in the generation of dangerous content such as weapon making guides or the leakage of sensitive information.
In the past year, large-scale inference models have significantly improved performance by investing more computing resources in the inference process. Simply put, the model will spend more time and resources analyzing before answering each question, achieving deeper and more complex reasoning. Previous studies have suggested that this reasoning ability may also enhance security and help models reject harmful requests. But research shows that this ability may also be used to circumvent security measures.
Research has found that attackers can hide harmful requests in a long string of harmless inference steps, flooding the model’s thought process with a large amount of harmless content, thereby weakening internal security checks. In the experiment, AI’s attention was mainly focused on the previous steps, while harmful instructions at the end of the prompt were almost ignored.
As the inference chain extends, the success rate of attacks significantly increases: the shortest inference success rate is 27%, the natural inference length is 51%, and when expanding the inference chain, it skyrockets to over 80%.
This vulnerability affects almost all major AI models, including ChatGPT, Claude, Gemini, and Grok. Even the “alignment model” that has undergone security tuning will fail once the internal inference layer is utilized.
According to reports, expanding model inference capabilities has become the main means for AI companies to improve the overall performance of cutting-edge models in the past year. The enhanced reasoning ability enables the model to handle more complex problems, no longer just pattern matching, but more like the way humans solve problems.
Researchers propose “inferential perception protection” as a solution, which monitors the activity of security checks as AI gradually thinks about problems. If a step weakens the safety signal, the system will intervene and redirect attention to potentially harmful content. Early testing has shown that this method can maintain good performance of the model while effectively restoring security protection.
















暂无评论内容