
Apple study reveals AI's struggle with complex reasoning
08 Jun 2025
Ahead of the much-anticipated Worldwide Developers Conference (WWDC), Apple has published a study titled "The Illusion of Thinking: Understanding the Strengths and Limitations of Reasoning Models via the Lens of Problem Complexity."
The research tested several 'reasoning' artificial intelligence (AI) models, including Anthropic's Claude, OpenAI's models, DeepSeek R1, and Google's Thinking models.The goal was to see how well these systems could replicate human reasoning in complex problem-solving scenarios.
The study's focus and methodology
Evaluation critique
The study takes issue with the standard practice of evaluating Large Reasoning Models (LRMs) using established mathematical and coding benchmarks.
It argues these methods suffer from data contamination and fail to provide insights into reasoning trace structure and quality.
Instead, it proposes a controlled experimental testbed using algorithmic puzzle environments as a more effective way to assess AI reasoning capabilities.
AI models struggle with complex problem-solving scenarios
Research results
The study found that state-of-the-art LRMs like o3-mini, DeepSeek-R1, and Claude-3.7-Sonnet-Thinking still struggle to develop generalizable problem-solving capabilities.
Their accuracy collapses to zero beyond certain complexities across different environments.
This stark finding highlights a major limitation of current AI systems: the inability to handle complex problems consistently.
Need to evolve AI evaluation methods, says study
Future prospects
The research highlights the need to evolve AI evaluation methods and understand the fundamental benefits and limitations of LRMs.
It raises critical questions about these models' capabilities for generalizable reasoning and their performance scaling with increasing problem complexity.
These findings could inform Apple's future AI strategies, possibly focusing on specific use cases where current AI methods are not reliable.
Models actually reduce their reasoning effort as complexity increases
Scaling insights
The study also found a counter-intuitive scaling limit in the reasoning effort of these models, which is measured by inference token usage during the "thinking" phase.
Initially, these models spend more tokens, but as complexity increases, they actually reduce their reasoning effort closer to an inevitable accuracy collapse.
This observation provides further insight into the limitations of current AI systems in handling complex problems.
-
Climate change: Unusually cool May 2025 sets multiple weather records across India
-
Mumbai: Muslim Community Denied Bakri Eid Namaz At August Kranti Maidan For 2nd Year After Cultural Affairs Secretary Fails To Act
-
Lucknow to Stirling: Ghosts of 1857 in a Scottish Museum
-
27-Year-Old IIM Ahmedabad Student Earns A+ Using ChatGPT, Sparks Debate On AI In Academia
-
Royal Family LIVE: Meghan Markle told she's 'completely lost the plot' after latest move