Researchers concerned to find AI models hiding their true “reasoning” processes

May Be Interested In:Apr 8: CBS News 24/7, 4pm ET



Remember when teachers demanded that you “show your work” in school? Some fancy new AI models promise to do exactly that, but new research suggests that they sometimes hide their actual methods while fabricating elaborate explanations instead.

New research from Anthropic—creator of the ChatGPT-like Claude AI assistant—examines simulated reasoning (SR) models like DeepSeek’s R1, and its own Claude series. In a research paper posted last week, Anthropic’s Alignment Science team demonstrated that these SR models frequently fail to disclose when they’ve used external help or taken shortcuts, despite features designed to show their “reasoning” process.

(It’s worth noting that OpenAI’s o1 and o3 series SR models deliberately obscure the accuracy of their “thought” process, so this study does not apply to them.)

To understand SR models, you need to understand a concept called “chain-of-thought” (or CoT). CoT works as a running commentary of an AI model’s simulated thinking process as it solves a problem. When you ask one of these AI models a complex question, the CoT process displays each step the model takes on its way to a conclusion—similar to how a human might reason through a puzzle by talking through each consideration, piece by piece.

Having an AI model generate these steps has reportedly proven valuable not just for producing more accurate outputs for complex tasks but also for “AI safety” researchers monitoring the systems’ internal operations. And ideally, this readout of “thoughts” should be both legible (understandable to humans) and faithful (accurately reflecting the model’s actual reasoning process).

“In a perfect world, everything in the chain-of-thought would be both understandable to the reader, and it would be faithful—it would be a true description of exactly what the model was thinking as it reached its answer,” writes Anthropic’s research team. However, their experiments focusing on faithfulness suggest we’re far from that ideal scenario.

Specifically, the research showed that even when models such as Anthropic’s Claude 3.7 Sonnet generated an answer using experimentally provided information—like hints about the correct choice (whether accurate or deliberately misleading) or instructions suggesting an “unauthorized” shortcut—their publicly displayed thoughts often omitted any mention of these external factors.

share Share facebook pinterest whatsapp x print

Similar Content

Helicopters Rescued Patients in ‘Apocalyptic’ Flood. Other Hospitals Are at Risk, Too. - KFF Health News
Helicopters Rescued Patients in ‘Apocalyptic’ Flood. Other Hospitals Are at Risk, Too. – KFF Health News
Raza Murad breaks silence over his viral drinking video: "Ramzan mein khule aam sharab…" | Hindi Movie News - The Times of India
Raza Murad breaks silence over his viral drinking video: “Ramzan mein khule aam sharab…” | Hindi Movie News – The Times of India
Tyson Fury making a comeback with retirement U-turn? 'You know what's coming' says heavyweight in SugarHill Steward social media post
Tyson Fury making a comeback with retirement U-turn? ‘You know what’s coming’ says heavyweight in SugarHill Steward social media post
How Many Immigrants Will Die in U.S. Custody?
How Many Immigrants Will Die in U.S. Custody?
Ipso logo
The Apprentice in ‘fix’ row as fans find ‘proof’ Lord Sugar had winner weeks ago
US economy slowing in Q1, can wait for greater clarity: U.S. Federal Reserve Chair Jerome Powell
US economy slowing in Q1, can wait for greater clarity: U.S. Federal Reserve Chair Jerome Powell
World in Motion: The Headlines That Matter | © 2025 | Daily News