The Illusion of Thinking: Why Apple’s Findings Hold True for AI Code Reviews

Recent research has cast new light on the limitations of modern AI “reasoning” models. Apple’s 2025 paper The Illusion of Thinking shows that today’s Large Reasoning Models (LRMs) – LLMs that generate chain-of-thought or “thinking” steps – often fail on complex problems. In controlled puzzle experiments, frontier LRMs exhibited a complete accuracy collapse beyond a complexity threshold. In other words, after a certain level of difficulty, their answers become no better than random. Equally striking is their counter-intuitive effort scaling: LRMs ramp up their chain-of-thought as a problem grows harder, but only up to a point. Beyond that, they actually give up – even when the token budget remains ample, their detailed reasoning steps abruptly shrink. These findings suggest a fundamental gap: LRMs do not truly “think” in a scalable way, but rather pattern-match up to modest complexity and then fail.

Apple’s experiments also delineated three performance regimes for LRM-based reasoning:

Low complexity: On simple tasks, standard LLMs without explicit chain-of-thought often outperform specialized LRMs. The overhead of reasoning actually hurts trivial problems.
Medium complexity: In intermediate tasks, LRMs’ chain-of-thought provides some advantage, with step-by-step reasoning sometimes finding the right answer when a plain LLM misses it.
High complexity: Beyond a threshold of intricacy, both standard LLMs and LRMs “experience complete collapse” in accuracy. Neither approach can reliably solve very complex puzzles.

Crucially, Apple found that LRMs fail to implement explicit algorithms even when a solution is known. They often wander in inconsistent ways rather than applying a logical procedure. This means a reasoning-trained model might not actually follow a textbook algorithm for a coding or math problem, leading to bizarre or incorrect intermediate steps. In short, LRMs struggle with exact computation and consistency, whereas humans methodically apply rules.

Human vs. LLM Reasoning in Code Review

In practice, code review demands rich human reasoning beyond raw code inspection. Human reviewers draw on architectural knowledge, product intent, team conventions, and historical context when evaluating a change. They understand why a module was designed a certain way, how it fits into the larger system, and what business requirements underlie it. By contrast, an LLM sees only the text in front of it. Empirical studies note that AI “lacks deep contextual understanding”; it can flag pattern-based issues but cannot grasp the “bigger picture” of a project. For example, an LLM might mark a function as inefficient without realizing it was written that way to meet a specific business or legacy requirement. Similarly, AI can catch small code smells or naming issues, but struggles to judge whether an entire module needs refactoring for maintainability or whether it violates a high-level architecture principle.

Real-world code review often depends on context not visible in the diff. As one Google Cloud engineering blog explains, a reviewer typically needs more than just the line changes – they must incorporate “knowledge about the broader code base and environment”. A change to a function might depend on how earlier commits defined data structures or on settings in the deployment pipeline. Similarly, design documents, Jira tickets, and team conventions all influence whether a change is correct. LLMs, however, have a limited context window and no built-in awareness of project artifacts. As one analysis notes, “AI lacks deep contextual understanding” and may misinterpret unusual code patterns if it doesn’t know the intent or system constraints.

In short, humans connect the dots across documentation, history, and team norms; LLMs do not.

Simple vs. Complex Review Tasks

In practice, this gap shows up as a stark divide between simple and complex review tasks. Simple checks – formatting, linting rules, obvious syntax bugs, or common vulnerability patterns – are well within the grasp of current LLMs. AI tools can automatically reformat code, catch missing semicolons, spot typos in variable names or obvious off-by-one bugs, and even enforce style guidelines. These are the “easy” part of review, akin to running a linter or static analyzer at super-speed. In fact, studies show AI tools excel at pattern-based issues and can process thousands of lines instantly.

By contrast, high-context issues typically stump a bare LLM. Consider tasks like assessing whether a new database query fits the system’s performance budget, or if an API change matches the team’s design intent. These require knowledge of database indexing, expected load, or long-term project goals – information not present in the code snippet alone. Similarly, verifying security or compliance often needs an understanding of company policies or regulatory docs. An LLM without help may flag an unconventional function as “inefficient” or “suspicious” even if it was deliberately optimized for a special case or implemented to comply with an external standard.

For these complex reviews, we find that additional context is essential. In practice, the most effective AI review systems fetch extra data: they surface relevant architecture diagrams, requirement tickets, technical design docs, and the team’s code history. For example, pull request metadata like Jira or Confluence tickets can tell the model why the change was requested. One empirical study of LLM-assisted review built prototypes that automatically gather all related documents (code diffs, source files, issue tracker entries) via retrieval techniques. By providing this enriched context, the model could make more informed suggestions. In other words, complex reviews need LLMs to be surrounded by a knowledge graph of project-specific information; without it, the LLM’s “reasoning” is unreliable.

The Architecture of an AI Code Reviewer

These observations underline an important lesson: the best AI code review systems are more than just a large model. In my experience building such tools, success comes from engineering the surrounding context and tooling. A robust assistant typically uses a retrieval-augmented pipeline: before calling the LLM, it searches project artifacts for relevant snippets. For example, a semantic search might pull up past commit messages (which often explain why code changed), design docs, or API specifications related to the PR. Embedding commit history into the model’s knowledge base can allow “time-travel” queries like “Why was XML support removed?” by retrieving the commit discussion where that change happened.

Similarly, specialized static analysis tools are often integrated. A code review assistant might run linters, type checkers, or security scanners alongside the LLM. These tools can handle syntax and known patterns perfectly, leaving the LLM free to focus on higher-level issues. In practice, a high-quality system might operate in stages: first gathering context (diffs, docs, tickets), then running deterministic analyses (lint, format, tests), and finally invoking the LLM with a prompt that includes the curated context and specific questions. These pipelines can also include iterative steps (as agents) where the model can ask for more information or break a problem into sub-queries.

Design-wise, a few principles emerge from both the literature and our experience:

Limit and structure the context: Because LLMs have finite context windows, an assistant often splits the review into chunks or filters by relevance. For example, it may select only files or code sections affected by the change, rather than feeding the entire repo. Embedding tools (like Faiss or Milvus) can index code by meaning so the model can pull just the relevant pieces. Effective systems also use metadata filters (e.g. by directory or component) to zero in on the right area of the codebase.
Leverage domain docs and tickets: Architecture decision records, design docs, and issue descriptions contain critical hints about intent. Including these in the prompt – for instance, by summarizing a linked Jira ticket or quoting a design specification – can ground the model’s comments. Research suggests that developers value having “requirement tickets” in the model’s context, so building connectors to Jira or Confluence is key.
Embed problem-specific logic: Some checks are best hard-coded. For example, a system might have a library of corporate coding standards, API usage rules, or security policies that it explicitly applies (like a static analysis rule). The LLM can then handle only the questions these rules can’t easily encode (such as proposing an alternative algorithm).
Iterative clarification: Unlike a single-turn query, a good assistant may allow asking the user follow-up questions or providing answers in steps. For instance, it might identify an ambiguous change and prompt the engineer to clarify the intended behavior, mimicking a human review dialogue.

These architectural measures are necessary because, as Apple’s paper warns, LRMs on their own do not self-scale with complexity. Without such scaffolding, an LLM’s “reasoning” can devolve into gibberish as the task grows. As one Hugging Face analysis puts it, after a certain difficulty “the model had room to think, but stopped doing so” – likely because its internal heuristic search got stuck in local minima. This is why simply using the largest LLM or most reasoning-optimized model is insufficient. Instead, effective review assistants recognize the illusion of self-contained reasoning and explicitly supply context and computation where needed.

Conclusion

In sum, the illusion of thinking refers to the gap between appearance and reality in AI reasoning. Large language models can mimic thoughtful analysis on simple tasks, but they do not genuinely understand or follow complex logic the way humans do. Apple’s findings and our practical experience both highlight that LLMs fail dramatically beyond a certain complexity, often producing inconsistent or shallow chains of thought. In the domain of code review, this means LLMs excel at low-level checks but falter on high-context judgments – unless we carefully architect around them.

For tool builders and AI practitioners, the takeaway is clear: Focus on system design, not just model choice. A robust AI code reviewer must integrate knowledge sources (architecture docs, tickets, commit history), use classical analysis tools, and manage context intelligently. When built this way, LLMs become powerful assistants rather than brittle “thinkers.” By combining model-driven insights with human-like context-awareness, we can move beyond the illusion and create code review systems that truly enhance developer workflows.

Your AI code Review Agent

Wall of Defense | Aligning business context with code | Never let bad code reach production

No Credit Card

No Strings Attached

Avoiding the Illusion of Intelligence in AI Agents

Many AI agents fail to deliver on their promises in production environments. This article argues that the key to building resilient AI systems lies not in impressive demos, but in robust, architectural solutions that prioritize context, layered safeguards, and continuous improvement, using Panto AI as a case study.

Aug 06, 2025

AI Development Tools That Actually Deliver

AI is no longer just a buzzword; it's a critical component of the modern software development lifecycle. This article explores how AI tools are delivering measurable value across six key areas: code generation, code reviews, automated testing, refactoring, documentation, and metrics, providing insights and data to help tech leaders build a high-performing AI toolchain.

Aug 05, 2025

We raised. We’re building harder.

Panto AI announces its pre-seed funding from Antler Singapore, marking a new chapter focused on revolutionizing code review. The company's AI-powered Code Review Agent is already demonstrating significant improvements in merge times and defect detection, with plans to expand into a comprehensive QA Agent.

Jul 31, 2025

How AI Affects Developer Literacy: A Guide for CTOs, CEOs & Rapid-Growth Tech Teams

While AI promises to revolutionize software development, an over-reliance on AI tools can subtly erode foundational developer skills. This guide for CTOs, CEOs, and rapid-growth tech teams explores the hidden risks of AI on developer literacy and outlines strategies to leverage AI for productivity without sacrificing core competencies.

Jul 31, 2025

Context Engineering: The Hidden Superpower Fueling Next-Gen AI

Beyond prompt hacks, context engineering is the critical behind-the-scenes work that transforms LLMs from clever demos into reliable, scalable AI systems. This article explains why managing the entire AI context window—including user history, business logic, and relevant data—is the true foundation for advanced, production-ready AI.

Jul 30, 2025

Welcome to the AI-Powered Front-End Playground: How AI Can Supercharge Your Rise from Developer to Front-End Architect

The front-end development landscape is being rapidly transformed by AI. This article explores how AI tools, from code generation to advanced code review, can significantly accelerate a developer's journey to becoming a front-end architect by automating mundane tasks, enhancing learning, and improving overall project quality.

Jul 29, 2025