Are Large Reasoning Models Thinking?

Pegasus x Naturally LA

Fueling the Next Generation of CPG Brands

We’ve partnered with Naturally LA — the hub of SoCal’s CPG community — to support mission-driven founders building standout consumer brands.

Through Ignite DTC, our 12-week accelerator, we help early-stage DTC brands scale with purpose, power, and traction. Backed by Pegasus’ playbook and Shopline’s AI platform, the program blends real-world execution, data-driven strategy, and direct investor access — with a path to a $100K investment.

Apply now to Ignite DTC

Frontier language models now come packaged with something called “thinking.” Models like Claude 3.7 Thinking, DeepSeek-R1, and OpenAI’s o-series generate detailed chains of reasoning before giving an answer. These large reasoning models (LRMs) simulate deliberation, reflection, and step-by-step logic. On paper, this looks like a leap toward general intelligence. In practice, things are more complicated.

A recent study out of Apple rigorously tests these models—and what they uncover is both impressive and unsettling. It turns out that the ability to “think” in language models may be more cosmetic than structural. In other words: models talk like they’re reasoning, but they often aren’t.

This article walks through what the researchers found, why it matters, and what it means for the future of AI systems that claim to think.

Why Traditional Benchmarks Fall Short

Most reasoning evaluations rely on established math and coding benchmarks—datasets like MATH500 or AIME. These are useful, but problematic:

They suffer from data contamination. Many questions have leaked into pretraining corpora.
They only evaluate the final answer, not how the model arrived there.
They don’t let you vary task complexity in a controlled way.

To address this, the authors created clean, synthetic puzzle environments where problem complexity could be manipulated with surgical precision. These environments include:

Tower of Hanoi
Checkers Jumping
River Crossing
Blocks World

Each puzzle isolates key aspects of algorithmic reasoning and planning while minimizing noise and external knowledge.

The Three Regimes of Model Behavior

When these models were tested across increasing levels of problem complexity, three distinct performance regimes emerged:

Low Complexity

At the simplest levels, standard language models (without “thinking”) outperformed their reasoning-enhanced counterparts. They were faster, more accurate, and used fewer tokens. LRMs, on the other hand, tended to over-elaborate, consuming more resources without better outcomes.

Medium Complexity

This is where LRMs showed their strength. As the tasks became more compositionally deep, models with structured thinking traces started to outperform standard models. They needed more tokens, but they reached higher levels of accuracy.

High Complexity

Eventually, both models collapsed. Accuracy dropped to zero. But something even more curious happened: the LRMs started thinking less. That is, as complexity increased, they reduced their token usage—even though they hadn’t hit any compute or context limits. The effort stopped scaling with the challenge.

This paradox—reasoning effort falling off just as complexity peaks—is at the heart of the paper’s claim: today’s reasoning models hit a ceiling not because they run out of room, but because they run out of strategies.

Overthinking and Underperforming

One of the most revealing analyses looked inside the reasoning traces. When models were solving simple puzzles, they often found the right answer early in the thought process—then kept going, exploring incorrect alternatives and eventually derailing the solution. This is the overthinking effect: models continue to generate plausible but irrelevant reasoning after having already arrived at a valid answer.

In medium-difficulty tasks, the reverse occurred. Models often wandered through a space of bad ideas before finally stumbling onto the correct one. And in complex tasks, they simply never found the right path.

Across all levels, models demonstrated limited self-correction. When a reasoning path went wrong, the models rarely recovered. They committed to a line of thinking and followed it through, regardless of outcome.

Even With the Right Algorithm, Models Fail

Perhaps the most surprising finding came when the researchers explicitly gave the models the correct algorithm. In the Tower of Hanoi puzzle, for instance, they were handed a recursive pseudocode solution to execute. Yet even then, model performance collapsed at roughly the same complexity levels as before.

In other words, the problem wasn’t just discovering the solution—it was executing a known, step-by-step plan. This suggests that today’s reasoning models still lack the ability to reliably carry out even simple procedural logic.

This aligns with broader questions in AI research: how well can these models simulate symbolic manipulation? Can they follow formal logical steps consistently? The evidence here says: not yet.

Failure Patterns Tell Their Own Story

The study also analyzed the first point of failure in model-generated solution sequences. Some interesting patterns emerged:

Failure often occurred earlier in high-complexity tasks than in medium-complexity ones, despite longer expected solution lengths.
Non-thinking models sometimes failed later than thinking models—suggesting that verbose reasoning doesn’t always improve execution.
Distribution of failure points was inconsistent, with high variance even within the same complexity level.

This further underscores that model “thinking” is not inherently robust. It’s brittle, sensitive to task structure, and often incoherent under stress.

Scaling Limits in Reasoning Effort

As puzzle complexity increased, reasoning models initially ramped up their token usage. But after a certain threshold, they started scaling down, using fewer tokens even as the task became harder. This is counterintuitive.

It wasn’t due to token limits or compute caps—models were operating far below their budget. Rather, it appears the models implicitly learned to abandon reasoning when the task became too difficult. It’s a kind of learned hopelessness, where the model gives up early rather than risk wasting effort.

This scaling limit has profound implications. It suggests that current training approaches and architectures are fundamentally misaligned with the structure of complex reasoning.

Implications for AI Development

This research reframes what it means for a model to “reason.” It’s not enough to produce long explanations, self-reflections, or neat chains of logic. What’s needed is coherent, effective, and resilient problem-solving—especially as tasks scale in complexity.

If you’re building AI agents, tutoring systems, or autonomous planners, these findings are a caution flag. The illusion of thinking is compelling—but real-world reasoning requires more than verbal smoke.

This paper makes it clear: we need better foundations, not just longer thoughts.

Where We Go From Here

The work raises important open questions:

Can reinforcement learning from feedback (RLF) actually produce generalizable reasoning strategies?
Are new architectural innovations required to handle procedural execution?
How do we build systems that don’t just think about solving problems—but actually solve them?

Until we answer these, reasoning in LLMs may remain performative—a convincing imitation of thought that collapses under pressure.

For those who want to dive deeper, the full paper is here: The Illusion of Thinking

Learn More

Visit us at pegasusangelaccelerator.com

For Aspiring Investors

Venture Fellow Program

Designed for aspiring venture capitalists and startup leaders, our program offers deep insights into venture operations, fund management, and growth strategies, all guided by seasoned industry experts.

Emerging Angel Program

Break the mold and dive into angel investing with a fresh perspective. Our program provides a comprehensive curriculum on innovative investment strategies, unique deal sourcing, and hands-on, real-world experiences, all guided by industry experts.

For Founders

Pegasus offers four exclusive programs tailored to help startups succeed—whether you're raising capital or need help with sales, we’ve got you covered.

Accelerate

Our highly selective, 12-week, remote-first accelerator is designed to help early-stage startups raise capital, scale quickly, and expand their networks. We invest $100K and provide direct access to 850+ mentors, strategic partners, and invaluable industry connections.

Ignite

The ultimate self-paced startup academy, designed to guide you through every stage—whether it's building your business model, mastering unit economics, or navigating fundraising—with $1M in perks to fuel your growth and a direct path to $100K investment. The perfect next step after YC's Startup School or Founder University.

Ignite DTC

A 12-week accelerator helping early-stage DTC brands scale from early traction to repeatable, high-growth revenue. Powered by Pegasus' playbook and Shopline’s AI-driven platform, it combines real-world execution, data-driven strategy, and direct investor access—plus a direct path to $100K investment—to fuel brand success.

Launchpad

A 12-week, self-paced program designed to help founders turn ideas into scalable startups. Built by Pegasus & Spark XYZ, it provides expert guidance, a structured playbook, and investor access. Founders who execute effectively can position themselves for a potential $100K investment.

Spark xyz

An all-in-one platform that connects startups, investors, and accelerators, streamlining fundraising, deal flow, and cohort management. Whether you're a founder raising capital, an investor sourcing deals, or an organization running programs, Sparkxyz provides the tools to power faster, more efficient collaboration and growth.

Apply now to join an exclusive group of high-potential startups!