The Illusion of Thinking: Strengths and limitations of reasoning models [pdf]

(ml-site.cdn-apple.com)

Comments

jackdoe 7 June 2025
I think one of the reason we are confused about what LLMs can do is because they use language. And we look at the "reasoning traces" and the tokens there look human, but what is actually happening is very alien to us, as shown by "Biology of Large Language Models"[1] and "Safety Alignment Should Be Made More Than Just a Few Tokens Deep"[2]

I am struggling a lot to see what the tech can and can not do, particularly designing systems with them, and how to build systems where the whole is bigger than the sum of its parts. And I think this is because I am constantly confused by their capabilities, despite understanding their machinery and how they work, their use of language just seems like magic. I even wrote https://punkx.org/jackdoe/language.html just to remind myself how to think about it.

I think this kind of research is amazing and we have to spend tremendous more effort into understanding how to use the tokens and how to build with them.

[1]: https://transformer-circuits.pub/2025/attribution-graphs/bio...

[2]: https://arxiv.org/pdf/2406.05946

curious_cat_163 7 June 2025
> Rather than standard benchmarks (e.g., math problems), we adopt controllable puzzle environments that let us vary complexity systematically

Very clever, I must say. Kudos to folks who made this particular choice.

> we identify three performance regimes: (1) low complexity tasks where standard models surprisingly outperform LRMs, (2) medium-complexity tasks where additional thinking in LRMs demonstrates advantage, and (3) high-complexity tasks where both models experience complete collapse.

This is fascinating! We need more "mapping" of regimes like this!

What I would love to see (not sure if someone on here has seen anything to this effect) is how these complexity regimes might map to economic value of the task.

For that, the eval needs to go beyond puzzles but the complexity of the tasks still need to be controllable.

stephc_int13 7 June 2025
Human language is far from perfect as a cognitive tool but still serves us well because it is not foundational. We use it both for communication and some reasoning/planning as a high level layer.

I strongly believe that human language is too weak (vague, inconsistent, not expressive enough etc.) to replace interactions with the world as a basis to build strong cognition.

We're easily fooled by the results of LLM/LRM models because we typically use language fluency and knowledge retrieval as a proxy benchmark for intelligence among our peers.

antics 7 June 2025
I think the intuition the authors are trying to capture is that they believe the models are omniscient, but also dim-witted. And the question they are collectively trying to ask is whether this will continue forever.

I've never seen this question quantified in a really compelling way, and while interesting, I'm not sure this PDF succeeds, at least not well-enough to silence dissent. I think AI maximalists will continue to think that the models are in fact getting less dim-witted, while the AI skeptics will continue to think these apparent gains are in fact entirely a biproduct of "increasing" "omniscience." The razor will have to be a lot sharper before people start moving between these groups.

But, anyway, it's still an important question to ask, because omniscient-yet-dim-witted models terminate at "superhumanly assistive" rather than "Artificial Superintelligence", which in turn economically means "another bite at the SaaS apple" instead of "phase shift in the economy." So I hope the authors will eventually succeed.

gwd 7 June 2025
> Through extensive experimentation across diverse puzzles, we show that frontier LRMs face a complete accuracy collapse beyond certain complexities. Moreover, they exhibit a counterintuitive scaling limit: their reasoning effort increases with problem complexity up to a point, then declines despite having an adequate token budget.

This is exactly my experience with coding. Start simple and build up complexity, and everything is great until you get to some threshold, at which point it completely falls apart and seems to stop even trying. Getting effective utilization out of Claude + aider involves managing the complexity that the LLM sees.

avsteele 8 June 2025
People are drawing erroneous conclusions from this.

My read of this is that the paper demonstrates that given a particular model (and the problems examined with it) that giving more thought tokens does not help on problems above a certain complexity. It does not say anything about the capabilities of future, larger, models to handle more complex tasks. (NB: humans trend similarly)

My concern is that people are extrapolating from this to conclusions about LLM's generally, and this is not warranted

The only part about this i find even surprising is he abstract's conclusion (1): that 'thinking' can lead to worse outcomes for certain simple problem. (again though, maybe you can say humans are the same here. You can overthink things)

actinium226 6 June 2025
Man, remember when everyone was like 'AGI just around the corner!' Funny how well the Gartner hype cycle captures these sorts of things
teleforce 7 June 2025
> We found that LRMs have limitations in exact computation: they fail to use explicit algorithms and reason inconsistently across puzzles.

It seems that AI LLMs/LRMs need helps from their distant cousins namely logic, optimization and constraint programming that can be attributed as intelligent automation or IA [1],[2],[3],[4].

[1] Logic, Optimization, and Constraint Programming: A Fruitful Collaboration - John Hooker - CMU (2023) [video]:

https://www.youtube.com/live/TknN8fCQvRk

[2] "We Really Don't Know How to Compute!" - Gerald Sussman - MIT (2011) [video]:

https://youtube.com/watch?v=HB5TrK7A4pI

[3] Google OR-Tools:

https://developers.google.com/optimization

[4] MiniZinc:

https://www.minizinc.org/

thomasahle 7 June 2025
All the environments the test (Tower of Hanoi, Checkers Jumping, River Crossing, Block World) could easily be solved perfectly by any of the LLMs if the authors had allowed it to write code.

I don't really see how this is different from "LLMs can't multiply 20 digit numbers"--which btw, most humans can't either. I tried it once (using pen and paper) and consistently made errors somewhere.

bigEnotation 8 June 2025
Reasoning models are just wrappers over the base model. It was pretty obvious it wasn’t actually reasoning but rather just refining the results using some kind of reasoning like heuristic. At least that’s what I assumed when they were released and you couldn’t modify the system prompt.
a11r 8 June 2025
The system prompt in this experiment limits the solution to always spell out the concrete moves verbally. A human solving the Tower of Hanoi gives up around N=4 and goes off to invent a recursive solution instead. Prompted differently, the LLM would solve these puzzles just fine.

Here is my complete review/analysis of the paper: https://www.linkedin.com/pulse/art-abstraction-human-advanta...

edit: fixed typo

jbentley1 7 June 2025
Is Apple failing at AI so they just put all their R&D towards convincing themselves it isn't important?
bilsbie 8 June 2025
Interestingly I just hit an example of this. Highly specific but I was asking about pickleball strategy and grok and Claude both couldn’t seem to understand you can’t aim at the opponent’s feet when you’re hitting up.

Just kept regurgitating internet advice and I couldn’t get it to understand the reasoning on why it was wrong.

JusticeJuice 6 June 2025
Their finding of LLMs working best at simple tasks, LRMs working best at medium complexity tasks, and then neither succeeding at actually complex tasks is good to know.
bilsbie 8 June 2025
I wonders if there’s past symbolic reasoning research we could integrate into LLMs. They’re really good at parsing text and understanding the relationships between objects ie getting the “symbols” correct.

Maybe we plug into something like prolog (or other such strategies?)

kamranjon 7 June 2025
The two interesting things I learned after reading this paper:

Even when given the exact steps needed to arrive at a solution in the prompt, the reasoning models still require just as many steps to reach a workable solution as they would if they weren’t given the solution in the prompt.

The other thing, which seems obvious in hindsight, but I don’t typically use these reasoning models in my day to day - is that it requires a significant amount of tokens to reach the point where reasoning models outperform non-reasoning models by a significant margin.

nialv7 6 June 2025
I've seen this too often, papers that ask questions they don't even bother to properly define.

> Are these models capable of generalizable reasoning, or are they leveraging different forms of pattern matching?

Define reasoning, define generalizable, define pattern matching.

For additional credits after you have done so, show humans are capable of what you just defined as generalizable reasoning.

ksec 8 June 2025
This actually set back my expectation of AI / LLM / LRM by at least 5 if not 10 years. But someone please correct me if I am wrong.

My idea was that up to a few years ago while AI / LLM is good at getting conversational or dishing out results that is in a language we understand. It still doesn't "understand" anything and in a lot of time conjured up that seems remotely correctly. Pattern matching over a very large data set that could be correct for 70% and increasingly to 80%+ of the time. However more accurate predictions would require order of magnitude more computing resources.

But pattern matching is still, pattern matching. There is no reasoning behind it. 1+1 will never equals to 11 but it may have skewed towards that results because of Javascript. When fundamental logic isn't behind any of these progress and process. The very bottom layer of any conversation / information / results are fragile.

So I have been skeptical of AI progress or LLM. That was until LRM or as the title said Reasoning LLMs. I thought we somehow manage to programme critical thinking into it, or some sort of reflection / fact checking / rationale / basic logic as fundamental principle. And while I can tell LRM isn't and wont be perfect, and possibly never quite reach AGI, the layer will improve over time until we find different ways to progress. And we will have something I called Assisted Intelligence. Which is what a lot of people uses as AI programming today.

Instead what this shows is that LRM isn't reasoning at all. It is LLM conjured up excuses to make it look like it is reasoning. It is another set of pattern matching specially made up for reasoning to look like it is reasoning. It is basically a kid making things up on why he got the results without thinking because he just want to get away from class or homework that looks very clever.

May be the title gave it away, and made be we got tricked. It was always a LLM specifically trained for showcasing "reasoning". The actual reasoning behind the scene is never done. Hence the title "The Illusion of Thinking".

mcswell 8 June 2025
Discussion of this article by Gary Marcus: https://garymarcus.substack.com/p/a-knockout-blow-for-llms
rikafurude21 8 June 2025
Apple, with all the compute and talent anyone could ask for, did not manage to create a frontier LLM model and was forced to bite the bullet and include GPT in their coveted walled garden in order to keep up and not be left behind in our current AI-Hype-(bubble) mania. I suspect this hurts their ego greatly and thats why they feel compelled to "talk trash". Saltiness wont get them anywhere though. I hope they manage to pull through. It would be a shame if they also went where Microsoft currently is.
nrjpoddar 9 June 2025
One of the best analysis I found is this blog by Vishal Misra https://medium.com/@vishalmisra/the-illusion-of-thinking-why...
akomtu 7 June 2025
The difference between imitation and reasoning can be made more clear if we switch from language to numbers:

  1 3 7 15 31 63 ...
How do you continue this sequence? What's the 1000000th number in this sequence? Imitation continues the likeness of what it sees and quickly gets off track. Imitation can't go abstract and tell the 1000000th element without writing down a million numbers leading to the answer. Reasoning finds the rule behind the set of examples and uses this rule to predict the next numbers, so it never gets off track.

The rule generating the sequence can be a sophisticated recurrent formula, e.g. a(k) = 2a(k-1) - sqrt(a(k-3)). Imitation can't solve this problem beyond trivial examples, but an AI can do what a scientist would do: come up with hypotheses, verify them against the examples and eventually find a formula that's reasonably accurate. The role of an LLM here is to suggest possible formulas.

The same sequence of examples can be generated by many formulas that differ in complexity and accuracy. This provokes the idea of a simple competition between AIs: the one that creates the simplest formula that's 99.5% accurate - wins. The formula really means a small program, once we get beyond trivial recurrent rules.

The ability to find simple and accurate models of reality is the essense of intelligence.

esafak 7 June 2025
I don't know that I would call it an "illusion of thinking", but LLMs do have limitations. Humans do too. No amount of human thinking has solved numerous open problems.
benlivengood 7 June 2025
These are the kind of studies that make so much more sense than the "LLMs can't reason because of this ideological argument or this one anecdote" posts/articles. Keep 'em coming!

And also; the frontier LLMs blow older LLMs out of the water. There is continual progress and this study would have been structured substantially the same 2 years ago with much smaller N on the graphs because the regimes were much tinier then.

ivape 6 June 2025
This is easily explained by accepting that there is no such thing as LRMs. LRMs are just LLMs that iterate on its own answers more (or provides itself more context information of a certain type). The reasoning loop on an "LRM" will be equivalent to asking a regular LLM to "refine" its own response, or "consider" additional context of a certain type. There is no such thing as reasoning basically, as it was always a method to "fix" hallucinations or provide more context automatically, nothing else. These big companies baked in one of the hackiest prompt engineering tricks that your typical enthusiast figured out long ago and managed to brand it and profit off it. The craziest part about this was Deepseek was able to cause a multi billion dollar drop and pump of AI stocks with this one trick. Crazy times.
jackson12t 9 June 2025
This feels a bit like a weird way to test 'thinking' in models, and reminds me of the old story of Gauss[1] and his classmates being assigned the task of adding up the numbers from 1-100.

I think the way the paper lays out the performance regimes is pretty interesting, but I don't think they achieved their goal of demonstrating that LRMs can't use reasoning to solve complex puzzles organically (without contamination/memorization): IMO testing the model's ability to define an algorithm to solve the puzzle would have been a better evaluation of that (rather than having the model walk through all of the steps manually). I don't know that I'd use an LRM for this sort of long-tail reasoning where it has to follow one single process for a long time over just one prompt; if I needed a really long chain of reasoning I'd use an agent or workflow.

It sounds more like the tests measure a model's ability to reason coherently and consistently over many steps rather than a model's ability to understand and solve a complex puzzle. For example, for the Tower of Hanoi, a prompt like "Define an algorithm that will find the sequence of moves to transform the initial configuration into the goal configuration" (e.g. "find an arithmetic series formula, young Gauss") seems like it would have been a better approach than "Find the sequence of moves to transform the initial configuration into the goal configuration" (e.g. "add up all these numbers"). This is kind of seen in how the study included a step where the LRMs were given the algorithm and then asked to solve the problem, the focus was on an LRM's ability to follow the steps, not their ability to come up with an algorithm/solution on their own.

In a job interview, for example, who among us would accept inability to hold all of the `(2^n) - 1` steps of the Tower of Hanoi in our brain as evidence of poor reasoning ability?

Again, I think it's a really interesting study covering a model's ability to consistently follow a simple process over time in pursuit of a static objective (and perhaps a useful benchmark moving forward), but I'm not confident that it successfully demonstrates a meaninful deficiency in overall reasoning capability.

[1]: https://www.americanscientist.org/article/gausss-day-of-reck...

Yenrabbit 8 June 2025
A few observations: 1) the performance vs complexity curve looks very similar to that for most humans (having seen groups attempt Towers of Hanoi with 5 car tires) haha 2) models can trivially solve some of these tasks when given tools 3) this is an internship paper with some quirks that many mostly dismissed, but is being quoted everywhere as "Apple proves LLMs can't ever reason"

Anyway, fun experiment to test your understanding of these things but don't take any conclusions as gospel :)

yalogin 8 June 2025
Isn't it a matter of training? This is the way I think about LLMs. They just "learned" so much context that they spit out tokens one after the other based on that context. So if it hallucinates it's because it understood the context wrong or doesn't grasp the nuance in the context. The more complicated the task the higher chance of hallucinations. Now I don't know if this can be improved with more training but that is the only tool we have.
danck 7 June 2025
In figure 1 bottom-right they show how the correct answers are being found later as the complexity goes higher. In the description they even state that in false responses the LRM often focusses on a wrong answer early and then runs out of tokens before being able to self-correct. This seems obvious and indicates that it’s simply a matter of scaling (bigger token budget would lead better abilities for complexer tasks). Am I missing something?
mitch_said 7 June 2025
Not ashamed to admit I found the original paper daunting, so I made a top-down, Q&A-based mind map to help me understand it: https://app.gwriter.io/#/mindmap/view/2d128d6e-c3e8-4b99-8f4...
piskov 8 June 2025
All "reasoning" models hit a complexity wall where they completely collapse to 0% accuracy.

No matter how much computing power you give them, they can't solve harder problems.

This research suggests we're not as close to AGI as the hype suggests.

Current "reasoning" breakthroughs may be hitting fundamental walls that can't be solved by just adding more data or compute.

Apple's researchers used controllable puzzle environments specifically because:

• They avoid data contamination • They require pure logical reasoning • They can scale complexity precisely • They reveal where models actually break

Models could handle 100+ moves in Tower of Hanoi puzzles but failed after just 4 moves in River Crossing puzzles.

This suggests they memorized Tower of Hanoi solutions during training but can't actually reason.

https://x.com/RubenHssd/status/1931389580105925115

bgnn 8 June 2025
Who is thinking that LLMs are thinking to begin with. It would have been amazing if they could. All the reasoning etc are just taking a stab at it but far from thinking.
beneboy 6 June 2025
This kind of explains why Claude will find the right solution, but then the more it thinks and keeps “improving” the more over-engineered (and sometimes wrong) the solution is. Interesting to see this coming up in formal research.
MaxPock 8 June 2025
Is it a case of being left by the train and now trying to rain on the parade ? Attention span has shifted from the next iPhone release to the newest LLM
cdrini 7 June 2025
When I use a normal LLM, I generally try to think "would I be able to do this without thinking, if I had all the knowledge, but just had to start typing and go?".

With thinking LLMs, they can think, but they often can only think in one big batch before starting to "speak" their true answer. I think that needs to be rectified so they can switch between the two. In my previous framework, I would say "would I be able to solve this if had all the knowledge, but could only think then start typing?".

I think for larger problems, the answer to this is no. I would need paper/a whiteboard. That's what would let me think, write, output, iterate, draft, iterate. And I think that's where agentic AI seems to be heading.

bicepjai 6 June 2025
The study challenges the assumption that more “thinking” or longer reasoning traces necessarily lead to better problem-solving in LRMs
stephc_int13 7 June 2025
The tldr: current approaches to add reasoning on top of language models are mostly tricks to squeeze a bit more juice out of the fruit, but the falloff is pretty steep and quick.
alansammarone 7 June 2025
I have a somewhat similar point of view to the one voiced by other people, but I like to think about it slightly differently, so I'll chime in - here's my take (although, admittedly, I'm operating with a quite small reasoning budget (5 minutes tops)):

Time and again, for centuries - with the pace picking up dramatically in recent decades - we thought we were special and we were wrong. Sun does not rotate around the earth, which is a pretty typical planet, with the same chemical composition of any other planet. All of a sudden we're not the only ones who could calculate, then solve symbolic equations, then play chess, then compose music, then talk, then reason (up to a point, for some definition of "reason"). You get my point.

And when we were not only matched, but dramatically surpassed in these tasks (and not a day earlier), we concluded that they weren't _really_ what made us special.

At this point, it seems to me reasonable to assume we're _not_ special, and the onus should be on anybody claiming that we are to at least attempt to mention in passing what is the secret sauce that we have (even if we can't quite say what it is without handwaving or using concepts that by definition can not be defined - "qualia is the indescribable feeling of red - its redness (?)).

Oh, and sorry, I could never quite grasp what "sentient" is supposed to mean - would we be able to tell we're not sentient if we weren't?

giardini 11 June 2025
Now that y'all have had time to digest this, could you please post an ELI5 for the great unwashed masses?
d4rkn0d3z 7 June 2025
I wrote my first MLP 25 years ago. After repeating some early experiments in machine learning from 20 ywars before that. One of the experiments I repeated was in text to speach. It was amazing to set up training runs and return after seveal hours to listen to my supercomputer babble like a toddler. I literally recall listening and being unable to distinguish the output from my NN from that of a real toddler, I happened to be teaching my neice to read around that same time. And when the NN had gained a large vocabulary such that it could fairly proficiently read aloud, I was convinced that I had found my PHD project and a path to AGI.

Further examination and discussion with more experienced researchers gave me pause. They said that one must have a solution, or a significant new approach toward solving the hard problems associated with a research project for it to be viable, otherwise time (and money) is wasted finding new ways to solve the easy problems.

This is a more general principle that can be applied to most areas of endeavour. When you set about research and development that involves a mix of easy, medium, and hard problems, you must solve the hard problems first otherwise you blow your budget finding new ways to solve the easy problems, which nobody cares about in science.

But "AI" has left the realm of science behind and entered the realm of capitalism where several years of meaningless intellectual gyration without ever solving a hard problem may be quite profitable.

8bitsrule 7 June 2025
Fusion has been 25 years away for all of my life.
jawiggins 8 June 2025
Figure 5 is really quite remarkable. It seems to show that normal LLMs are better at tasks where the correct answer is likely to be the next token. For tasks that require a small number of intermediate steps, current reasoning models do much better, but break down as the number of intermediate steps grow.

This seems to indicate that the next generation of models should focus on recursively solving small parts of the problem before function-calling another model to solve another small part of the problem and working it's answer into the reasoning loop.

Many seem to be citing this paper as an indication that LLMs are over - I think this indicates a clear path towards the next step function change in their abilities.

behnamoh 6 June 2025
Okay Apple, you got my attention. But I'm a strong proponent of "something is better than nothing" philosophy—even if OpenAI/Google/etc. are building reasoning models with the limitations that you describe, they are still a huge progress compared to what we had not long ago. Meanwhile you're not even trying.

It's so easy to criticize the works of others and not deliver anything. Apple—be Sam in Game of Thrones: "I'm tired of reading about the achievements of better men".