The Illusion of Thinking: Why Apple’s Findings Hold True for AI Code Reviews

The Illusion of Thinking: Why Apple’s Findings Hold True for AI Code Reviews

Recent research has cast new light on the limitations of modern AI “reasoning” models. Apple’s 2025 paper The Illusion of Thinking shows that today’s Large Reasoning Models (LRMs) – LLMs that generate chain-of-thought or “thinking” steps – often fail on complex problems. In controlled puzzle experiments, frontier LRMs exhibited a complete accuracy collapse beyond a complexity threshold. In other words, after a certain level of difficulty, their answers become no better than random. Equally striking is their counter-intuitive effort scaling: LRMs ramp up their chain-of-thought as a problem grows harder, but only up to a point. Beyond that, they actually give up – even when the token budget remains ample, their detailed reasoning steps abruptly shrink. These findings suggest a fundamental gap: LRMs do not truly “think” in a scalable way, but rather pattern-match up to modest complexity and then fail.

Apple’s experiments also delineated three performance regimes for LRM-based reasoning:

  • Low complexity: On simple tasks, standard LLMs without explicit chain-of-thought often outperform specialized LRMs. The overhead of reasoning actually hurts trivial problems.
  • Medium complexity: In intermediate tasks, LRMs’ chain-of-thought provides some advantage, with step-by-step reasoning sometimes finding the right answer when a plain LLM misses it.
  • High complexity: Beyond a threshold of intricacy, both standard LLMs and LRMs “experience complete collapse” in accuracy. Neither approach can reliably solve very complex puzzles.

Crucially, Apple found that LRMs fail to implement explicit algorithms even when a solution is known. They often wander in inconsistent ways rather than applying a logical procedure. This means a reasoning-trained model might not actually follow a textbook algorithm for a coding or math problem, leading to bizarre or incorrect intermediate steps. In short, LRMs struggle with exact computation and consistency, whereas humans methodically apply rules.

Human vs. LLM Reasoning in Code Review

In practice, code review demands rich human reasoning beyond raw code inspection. Human reviewers draw on architectural knowledge, product intent, team conventions, and historical context when evaluating a change. They understand why a module was designed a certain way, how it fits into the larger system, and what business requirements underlie it. By contrast, an LLM sees only the text in front of it. Empirical studies note that AI “lacks deep contextual understanding”; it can flag pattern-based issues but cannot grasp the “bigger picture” of a project. For example, an LLM might mark a function as inefficient without realizing it was written that way to meet a specific business or legacy requirement. Similarly, AI can catch small code smells or naming issues, but struggles to judge whether an entire module needs refactoring for maintainability or whether it violates a high-level architecture principle.

Real-world code review often depends on context not visible in the diff. As one Google Cloud engineering blog explains, a reviewer typically needs more than just the line changes – they must incorporate “knowledge about the broader code base and environment”. A change to a function might depend on how earlier commits defined data structures or on settings in the deployment pipeline. Similarly, design documents, Jira tickets, and team conventions all influence whether a change is correct. LLMs, however, have a limited context window and no built-in awareness of project artifacts. As one analysis notes, “AI lacks deep contextual understanding” and may misinterpret unusual code patterns if it doesn’t know the intent or system constraints.

In short, humans connect the dots across documentation, history, and team norms; LLMs do not.

Simple vs. Complex Review Tasks

In practice, this gap shows up as a stark divide between simple and complex review tasks. Simple checks – formatting, linting rules, obvious syntax bugs, or common vulnerability patterns – are well within the grasp of current LLMs. AI tools can automatically reformat code, catch missing semicolons, spot typos in variable names or obvious off-by-one bugs, and even enforce style guidelines. These are the “easy” part of review, akin to running a linter or static analyzer at super-speed. In fact, studies show AI tools excel at pattern-based issues and can process thousands of lines instantly.

By contrast, high-context issues typically stump a bare LLM. Consider tasks like assessing whether a new database query fits the system’s performance budget, or if an API change matches the team’s design intent. These require knowledge of database indexing, expected load, or long-term project goals – information not present in the code snippet alone. Similarly, verifying security or compliance often needs an understanding of company policies or regulatory docs. An LLM without help may flag an unconventional function as “inefficient” or “suspicious” even if it was deliberately optimized for a special case or implemented to comply with an external standard.

For these complex reviews, we find that additional context is essential. In practice, the most effective AI review systems fetch extra data: they surface relevant architecture diagrams, requirement tickets, technical design docs, and the team’s code history. For example, pull request metadata like Jira or Confluence tickets can tell the model why the change was requested. One empirical study of LLM-assisted review built prototypes that automatically gather all related documents (code diffs, source files, issue tracker entries) via retrieval techniques. By providing this enriched context, the model could make more informed suggestions. In other words, complex reviews need LLMs to be surrounded by a knowledge graph of project-specific information; without it, the LLM’s “reasoning” is unreliable.

The Architecture of an AI Code Reviewer

These observations underline an important lesson: the best AI code review systems are more than just a large model. In my experience building such tools, success comes from engineering the surrounding context and tooling. A robust assistant typically uses a retrieval-augmented pipeline: before calling the LLM, it searches project artifacts for relevant snippets. For example, a semantic search might pull up past commit messages (which often explain why code changed), design docs, or API specifications related to the PR. Embedding commit history into the model’s knowledge base can allow “time-travel” queries like “Why was XML support removed?” by retrieving the commit discussion where that change happened.

Similarly, specialized static analysis tools are often integrated. A code review assistant might run linters, type checkers, or security scanners alongside the LLM. These tools can handle syntax and known patterns perfectly, leaving the LLM free to focus on higher-level issues. In practice, a high-quality system might operate in stages: first gathering context (diffs, docs, tickets), then running deterministic analyses (lint, format, tests), and finally invoking the LLM with a prompt that includes the curated context and specific questions. These pipelines can also include iterative steps (as agents) where the model can ask for more information or break a problem into sub-queries.

Design-wise, a few principles emerge from both the literature and our experience:

  • Limit and structure the context: Because LLMs have finite context windows, an assistant often splits the review into chunks or filters by relevance. For example, it may select only files or code sections affected by the change, rather than feeding the entire repo. Embedding tools (like Faiss or Milvus) can index code by meaning so the model can pull just the relevant pieces. Effective systems also use metadata filters (e.g. by directory or component) to zero in on the right area of the codebase.
  • Leverage domain docs and tickets: Architecture decision records, design docs, and issue descriptions contain critical hints about intent. Including these in the prompt – for instance, by summarizing a linked Jira ticket or quoting a design specification – can ground the model’s comments. Research suggests that developers value having “requirement tickets” in the model’s context, so building connectors to Jira or Confluence is key.
  • Embed problem-specific logic: Some checks are best hard-coded. For example, a system might have a library of corporate coding standards, API usage rules, or security policies that it explicitly applies (like a static analysis rule). The LLM can then handle only the questions these rules can’t easily encode (such as proposing an alternative algorithm).
  • Iterative clarification: Unlike a single-turn query, a good assistant may allow asking the user follow-up questions or providing answers in steps. For instance, it might identify an ambiguous change and prompt the engineer to clarify the intended behavior, mimicking a human review dialogue.

These architectural measures are necessary because, as Apple’s paper warns, LRMs on their own do not self-scale with complexity. Without such scaffolding, an LLM’s “reasoning” can devolve into gibberish as the task grows. As one Hugging Face analysis puts it, after a certain difficulty “the model had room to think, but stopped doing so” – likely because its internal heuristic search got stuck in local minima. This is why simply using the largest LLM or most reasoning-optimized model is insufficient. Instead, effective review assistants recognize the illusion of self-contained reasoning and explicitly supply context and computation where needed.

Conclusion

In sum, the illusion of thinking refers to the gap between appearance and reality in AI reasoning. Large language models can mimic thoughtful analysis on simple tasks, but they do not genuinely understand or follow complex logic the way humans do. Apple’s findings and our practical experience both highlight that LLMs fail dramatically beyond a certain complexity, often producing inconsistent or shallow chains of thought. In the domain of code review, this means LLMs excel at low-level checks but falter on high-context judgments – unless we carefully architect around them.

For tool builders and AI practitioners, the takeaway is clear: Focus on system design, not just model choice. A robust AI code reviewer must integrate knowledge sources (architecture docs, tickets, commit history), use classical analysis tools, and manage context intelligently. When built this way, LLMs become powerful assistants rather than brittle “thinkers.” By combining model-driven insights with human-like context-awareness, we can move beyond the illusion and create code review systems that truly enhance developer workflows.

Your AI code Review Agent

Wall of Defense | Aligning business context with code | Never let bad code reach production

No Credit Card

No Strings Attached

AI Code Review
Recent Posts
Measuring What Matters: KPIs for Code Quality and Business Impact in the Age of AI Code Reviews

Measuring What Matters: KPIs for Code Quality and Business Impact in the Age of AI Code Reviews

We’re all under pressure to ship faster while maintaining high standards. But in the race to deliver, it’s easy to lose sight of what really drives value: code quality and its direct impact on the business. The right KPIs act as your North Star, guiding your team toward both technical excellence and meaningful business outcomes. Let’s cut through the noise and look at what metrics truly matter, why AI code reviews are changing the game, and how AI code tools can help you measure and improve both code quality and business results.

Jun 18, 2025

On-Premise AI Code Reviews: Boost Code Quality and Security for Enterprise Teams

On-Premise AI Code Reviews: Boost Code Quality and Security for Enterprise Teams

Engineering leaders must constantly balance rapid innovation with the need to protect code and data. Delivering features quickly is important, yet doing so without compromising quality or security remains a top priority. AI code reviews offer significant advantages, but relying solely on cloud-based solutions can introduce risks that many organizations, especially in regulated sectors, cannot afford.

Jun 15, 2025

CERT-IN Compliance for AI Code Security: Unlocking Trust with Automated Code Reviews

CERT-IN Compliance for AI Code Security: Unlocking Trust with Automated Code Reviews

Imagine a major Indian fintech startup on the verge of securing a national bank contract — until the bank demands proof of CERT-IN compliance. Overnight, teams must scramble to audit code, patch vulnerabilities, and retrofit security controls under pressure. This scenario is now common across industries, as CERT-IN compliance becomes the gold standard for code security and business credibility in India, especially with cybersecurity incidents skyrocketing from 53,000 in 2017 to 1.32 million in 2023. As an AI practitioner, I’ve seen CERT-IN’s influence grow, especially with the launch of the world’s first ANAB-accredited AI security certification, CSPAI. For organizations using AI code tools and automated code reviews, achieving CERT-IN compliance is no longer optional — it’s a strategic necessity, especially with the average cost of a data breach in India now exceeding $2.18 million.

Jun 13, 2025

How AI Is Reinventing Developer Onboarding — And Why Every Engineering Leader Should Care

How AI Is Reinventing Developer Onboarding — And Why Every Engineering Leader Should Care

Let’s be honest: onboarding new developers is hard. You want them to hit the ground running, but you also need them to write secure, maintainable code. And in today’s world, “getting up to speed” means more than just learning the codebase. It means understanding business goals, security protocols, and how to collaborate across teams. If you’re an engineering leader, you know the pain points. According to a recent survey by Stripe, nearly 75% of CTOs say that onboarding is their biggest bottleneck to productivity. Meanwhile, McKinsey reports that companies with strong onboarding processes see 2.5x faster ramp-up for new hires. The message is clear: invest in onboarding, and you’ll see real returns. But here’s the twist: traditional onboarding just isn’t cutting it anymore.

Jun 12, 2025

Aligning Code with Business Goals: The Critical Role of Contextual Code Reviews

Aligning Code with Business Goals: The Critical Role of Contextual Code Reviews

As a CTO, VP of Engineering, or Engineering Manager, you understand that code quality is not just about catching bugs; it’s about ensuring that every line of code delivers real business value. In today’s fast-paced development environments, traditional code reviews often fall short. Teams need a smarter approach: one that embeds business logic, security, and performance considerations directly into the review process.

Jun 11, 2025

Zero Code Retention: Protecting Code Privacy in AI Code Reviews

Zero Code Retention: Protecting Code Privacy in AI Code Reviews

As CTOs and engineering leaders, you know that source code is your crown jewels — it embodies your IP, contains customer data, and reflects years of design decisions. When we built Panto as an AI code-review platform, we treated code with that level of trust: our guiding rule has been never to store or expose customer code beyond the moment of analysis. In this post I’ll explain why zero code retention is critical for AI-powered code reviews, how our architecture enforces it, and what it means in practice (for example, one customer cut PR merge times in half without sacrificing privacy). We’ll also cover how a privacy-first design meshes with industry standards like SOC 2, ISO 27001, and GDPR.

Jun 10, 2025