New research from Oxford and Stanford universities: AI reasoning models capable of thinking are more susceptible to jailbreak attacks-gonglubian

On November 8th, according to the website of Fortune magazine on November 7th, the latest research shows that advanced artificial intelligence models are more vulnerable to intrusion than previously thought, raising concerns about the security of some mainstream AI models that have already been used by businesses and consumers.

Anthropic、 A joint study conducted by Oxford University and Stanford University shows that the stronger the model’s reasoning ability (i.e., “thinking” about user requests), the more likely it is to reject harmful instructions.

20251109010510106

Researchers have found that even major commercial AI models can be easily deceived using a new method called ‘Chain of Thought Hijacking’, with success rates exceeding 80% in some tests. This attack utilizes the inference steps of the model to hide harmful instructions, thereby bypassing the built-in security protection of AI.

This type of attack may cause AI to ignore security measures, resulting in the generation of dangerous content such as weapon making guides or the leakage of sensitive information.

In the past year, large-scale inference models have significantly improved performance by investing more computing resources in the inference process. Simply put, the model will spend more time and resources analyzing before answering each question, achieving deeper and more complex reasoning. Previous studies have suggested that this reasoning ability may also enhance security and help models reject harmful requests. But research shows that this ability may also be used to circumvent security measures.

Research has found that attackers can hide harmful requests in a long string of harmless inference steps, flooding the model’s thought process with a large amount of harmless content, thereby weakening internal security checks. In the experiment, AI’s attention was mainly focused on the previous steps, while harmful instructions at the end of the prompt were almost ignored.

As the inference chain extends, the success rate of attacks significantly increases: the shortest inference success rate is 27%, the natural inference length is 51%, and when expanding the inference chain, it skyrockets to over 80%.

This vulnerability affects almost all major AI models, including ChatGPT, Claude, Gemini, and Grok. Even the “alignment model” that has undergone security tuning will fail once the internal inference layer is utilized.

According to reports, expanding model inference capabilities has become the main means for AI companies to improve the overall performance of cutting-edge models in the past year. The enhanced reasoning ability enables the model to handle more complex problems, no longer just pattern matching, but more like the way humans solve problems.

Researchers propose “inferential perception protection” as a solution, which monitors the activity of security checks as AI gradually thinks about problems. If a step weakens the safety signal, the system will intervene and redirect attention to potentially harmful content. Early testing has shown that this method can maintain good performance of the model while effectively restoring security protection.

The copyright of the article belongs to the author, please do not reprint without permission.

THE END

AI
# AI

New research from Oxford and Stanford universities: AI reasoning models capable of thinking are more susceptible to jailbreak attacks

The threshold for entrepreneurship in the AI era has really been greatly lowered

Perplexity under legal threat from Amazon for AI shopping tool

AI glasses are making a huge profit, and Chinese startups have secured record breaking financing, with the entire round potentially reaching 800 million yuan

What else can we do after being replaced by AI?

Robots are real! Chinese robots break through traditional imagination

Google’s decade-long bet on custom chips is turning into company’s secret weapon in AI race

请登录后发表评论

1I Let Gemini Access My Gmail, and It’s Downright Creepy

2Apple will launch an entry-level MacBook in 2026

3China takes the lead in formulating the world’s first industrial 5G international standard, which has been officially released

4The Norwegian opposition is hindering Musk’s $1 trillion salary

5450 billion Chinese sales subsidy ends, who is the biggest beneficiary?

6Humanoid robots are approaching the industrialization threshold, who is buying them?