Ruchir Kulkarni

AI Product Engineer building agentic systems & LLM-based applications

Ruchir Kulkarni

AI Product Engineer building agentic systems & LLM-based applications

Ruchir Kulkarni

AI Product Engineer building agentic systems & LLM-based applications

Blog Image
Blog Image

4min read

Are AI Reasoning models really “thinking”?

Why This Matters?

Large language models (LLMs) now use chain-of-thought (CoT) prompts. They generate step-by-step traces before answering. Early results showed CoT could boost performance on math, logic, and planning. But most tests mix training data and test problems. We can’t tell if models truly reason or simply memorize patterns.

Apple’s research team built a fresh approach. They designed four classic puzzles with adjustable difficulty and perfect simulators. This lets them measure real reasoning ability without any data contamination.

Link to Paper: https://machinelearning.apple.com/research/illusion-of-thinking

The Four Puzzles

  1. Tower of Hanoi
    The game is to Move discs between three pegs. Rule: Never place a larger disk on a smaller one.

  2. Checker Jumping
    Players need to Swap two groups of tokens on a checkerboard by jumping. Rule: tokens jump over adjacent tokens into empty spots.

  3. River Crossing
    Ferry characters across a river without breaking safety rules (e.g., wolf can’t be left with goat).

  4. Blocks World
    Game is to Restack blocks on a table to match a goal arrangement.

Model Comparison

Researchers tested two styles of LLMs on these puzzles:

  • Direct Models: They see the puzzle and the question, then output a final answer.

  • Chain-of-Thought Models: They first generate a reasoning trace many “thought” tokens then give the answer.

Both types ran on hundreds or thousands of puzzle instances, at varying levels of difficulty. The team measured accuracy, token usage, and the content of CoT traces.

Three Phases of Performance

As puzzle complexity grows, models show three distinct behaviors:

  • Easy Tasks
    Direct models win. CoT models waste time “thinking” step by step and perform worse.

  • Medium Difficulty
    CoT models pull ahead. Extra reasoning tokens help solve puzzles that stump direct models.

  • High Complexity
    Both types collapse to 0% accuracy. Even with huge token budgets, no model solves the hardest versions.

This reveals a surprising ceiling: more “thinking” doesn’t push models past a certain threshold.

Peeking Inside the Thought Trace

The team didn’t stop at accuracy. They analyzed CoT traces to see what “thinking” really looks like:

  • Overthinking: Models often hit the correct intermediate move early. Then they continue exploring wrong paths. That wastes compute and leads to failure.

  • Token Budget: As puzzles grow harder, models initially ramp up token usage. But near the failure point, they use fewer tokens, they literally give up before exploring all options.

These patterns hold across different LLM architectures and sizes.

Even with a Recipe, They Fail

To test if hints help, researchers provided the exact algorithm for each puzzle in the prompt. You’d think this would boost performance. It didn’t. Models still hit the same difficulty ceiling. The hard limit remains, even when they know the rules in detail.

Final Thoughts

  • Chain-of-Thought Isn’t Real Reasoning
    CoT helps a bit on medium tasks, but it’s not genuine logic. It’s pattern matching over extended tokens.

  • Fundamental Limits Exist
    LLMs fail completely on sufficiently hard, rule-based puzzles — even with more computing power or explicit algorithms.

  • Overthinking Can Hurt
    Extra reasoning can waste tokens and reduce accuracy on easier tasks.

Stay tuned to rKive for more deep dives into cutting-edge AI and security research. We’ll track new innovations, break them down simply, and highlight real breakthroughs, so you never miss a beat!

Link to my medium profile: https://medium.com/@ruchirkulkarni/

LET'S WORK
TOGETHER

LET'S WORK
TOGETHER

Create a free website with Framer, the website builder loved by startups, designers and agencies.