Last month, Apple’s Machine Learning Research team published a provocative analysis titled “The Illusion of Thinking: Understanding the Strengths and Limitations of Reasoning Models via the Lens of Problem Complexity.” The timing couldn’t have been more opportune, right before WWDC and amid a global resurgence of interest in foundation models.
In this post, we break down the key insights from Apple’s research, share why it matters to B2B leaders and suggest how to rethink your approach to deploying AI in problem solving roles.
1. Context: The Rise of “Large Reasoning Models” (LRMs)
In recent years, AI developers have equipped large language models (LLMs) with chain‑of‑thought (CoT) systems that prompt them to think aloud – breaking problems into sequences of intermediate steps. These so‑called Reasoning Models (LRMs) have achieved headline‑grabbing performance on benchmarks in math, logic and code generation.
Yet, Apple questions whether these models are actually reasoning, or merely mimicking thought through pattern recognition. The company raises two concerns:
- Benchmark contamination: Many evaluation datasets overlap with model training data, so success might reflect memorisation, not reasoning.
- Opaque CoT traces: A high‑quality chain of thought might still hide flawed logic.
To tackle these issues, Apple’s team designed controllable puzzle environments, including variations of Tower‑of‑Hanoi and river‑crossing puzzles. These benchmarks allow precise manipulation of problem complexity while controlling logical structure, enabling evaluation not just of final answers but also the models’ internal reasoning steps.
2. The Three Regimes of Model Performance
Through systematic testing, the researchers identified three distinct behavioural regimes:
• Low complexity
For simpler puzzles, both LRMs and standard LLMs perform surprisingly well, but non‑reasoning LLMs often outperform LRMs.
This raises the question: if simpler tasks don’t require structured reasoning, why go to all that trouble?
• Medium complexity
In this regime, LRMs with chain‑of‑thought parsing show notable advantages. Generating reasoning traces helps tackle moderately complex problems.
• High complexity
Here, performance shuts down abruptly. Both LRMs and vanilla LLMs collapse, struggling to solve more elaborate problems, despite having enough token budget to reason longer.
3. The Counter‑Intuitive Scaling Limit
What struck me most, and what Apple emphasises, is a surprising phenomenon: as puzzle complexity increases beyond a threshold, reasoning effort actually declines, even when the model is still allocated compute budget. Instead of working harder, models ‘give up’ – they search for shortcuts.
In their Tower‑of‑Hanoi test, for example, when the disk count climbs to levels requiring hundreds or thousands of moves, models abandon step‑by‑step reasoning. They register, ‘This is too tedious,’ then try to jump to a general solution, and often fail when no shortcut exists.
As one commentator summarised:
Past eight or nine disks, the skill … silently changes from “can the model reason through the sequence?” to “can the model come up with a generalised solution that skips reasoning?”
4. Where LRMs Falter
Apple’s research highlights several core weaknesses of modern LRMs:
- Exact computation fails
LRMs struggle to reliably execute precise, step‑by‑step algorithms. They may reason through simple arithmetic but fail at full enumeration or systematic search.
- Inconsistent reasoning
A model might solve one puzzle correctly and trip over an almost identical variant. It doesn’t generalise a stable reasoning strategy.
- Pattern mimicry over understanding
The success of LRMs on benchmarks may come from statistical pattern matching rather than internal representations of logical structures. They simulate reasoning rather than “do” reasoning.
5. Why This Matters for B2B Leaders
Every B2B business exploring AI for high‑stakes tasks – financial modelling, regulatory compliance, supply‑chain logistics, predictive maintenance – needs to take heed:
- Don’t mistake form for function
A model that writes “I think step 1, then step 2” isn’t necessarily doing real reasoning. Chain‑of‑thought outputs can be window‑dressing that hide brittle behaviour.
- Beware of the complexity cliff
If your problem involves hundreds of interdependent steps or large-scale enumeration, don’t assume an LRM will scale. Apple shows that beyond a certain point, models shut down instead of grinding through complexity.
- Test structured reasoning explicitly
Don’t rely solely on benchmark performance. You need tailored puzzle‑like tests analogous to your domain. Only then can you expose algorithmic blind spots.
6. Aligning Business Expectations with Technical Reality
As a consultant, I’ve seen leaders fall into two camps:
- Over‑enthusiastic adoption: Believing AI can think like a human, only to run into failures in the field.
- Excessive scepticism: Throwing away promising AI approaches due to early missteps.
Apple’s research encourages a more balanced view:
- For simple to medium complexity tasks, LRM‑style systems can elevate model accuracy, provided we test thoroughly.
- For very complex processes, we must anticipate failure modes and pair LLMs with alternative strategies – code, logic solvers, or human oversight.
It’s not that LRMs are useless, but they aren’t a magic bullet either. Understanding their cutoff points enables smarter, hybrid system designs.
7. Practical Steps for B2B AI Programs
Here are five steps we recommend for any B2B organisation deploying AI for reasoning-heavy workflows:
- Define complexity bands
Break down your workflows into low, medium and high complexity segments. Use puzzle analogies to calibrate. - Build complexity‑controlled test kits
Create challenge puzzles that target your workflows’ hardest parts e.g., multi-party contract flows, deep supply‑chain dependencies. - Measure chain‑of‑thought alignment
Compare reasoning traces across repeated runs. Are they stable? Do logical steps match expected domain reasoning? - Experiment with hybrid pipelines
Pair LLMs with domain-specific tools: rule‑based engines, symbolic solvers, or domain-trained code troves, to plug reasoning gaps. - Track failure modes over time
Use tooling and monitoring to record complexity levels at failure points. Monitor token usage to catch “early quit” behaviour.
Final Thoughts: Embracing the Illusion with Intention
Apple’s Illusion of Thinking isn’t a doom‑and‑gloom verdict, it’s an invitation to recalibrate. The paper forces us to grapple with whether the AI systems we build are genuinely reasoning or merely replaying thought-like patterns. That distinction is critical as these systems take on increasingly mission‑critical roles in B2B processes.
We encourage C‑suite and board-level stakeholders to ask:
- How do we define and quantify ‘complex reasoning’ in our domain?
- What test frameworks do we have to assess LLM reasoning boundaries?
- Where do we need hybrid or fallback systems to ensure reliability?
Approached with clarity, this layered understanding yields smarter AI adoption rather than hype-driven risk.
Summary
Apple’s Illusion of Thinking study shows:
- LRMs help in medium complexity, but may underperform on low or high complexity tasks.
- When problems become too hard, models spontaneously quit instead of reasoning deeper.
- They suffer from inconsistent logic and fail at pure algorithmic tasks.
- B2B users need bespoke testing frameworks addressing domain complexity and fail‑safe systems for edge‑case scenarios.
In the end, AI may not think like a human yet, but with thoughtfulness in deployment, we can still harness its strength while hedging its illusions.