Tracing the thoughts of a large language model

(anthropic.com)

Comments

marcelsalathe 27 March 2025
I’ve only skimmed the paper - a long and dense read - but it’s already clear it’ll become a classic. What’s fascinating is that engineering is transforming into a science, trying to understand precisely how its own creations work

This shift is more profound than many realize. Engineering traditionally applied our understanding of the physical world, mathematics, and logic to build predictable things. But now, especially in fields like AI, we’ve built systems so complex we no longer fully understand them. We must now use scientific methods - originally designed to understand nature - to comprehend our own engineered creations. Mindblowing.

cadamsdotcom 27 March 2025
So many highlights from reading this. One that stood out for me is their discovery that refusal works by inhibition:

> It turns out that, in Claude, refusal to answer is the default behavior: we find a circuit that is "on" by default and that causes the model to state that it has insufficient information to answer any given question. However, when the model is asked about something it knows well—say, the basketball player Michael Jordan—a competing feature representing "known entities" activates and inhibits this default circuit

Many cellular processes work similarly ie. there will be a process that runs as fast as it can and one or more companion “inhibitors” doing a kind of “rate limiting”.

Given both phenomena are emergent it makes you wonder if do-but-inhibit is a favored technique of the universe we live in, or just coincidence :)

polygot 27 March 2025
There needs to be some more research on what path the model takes to reach its goal, perhaps there is a lot of overlap between this and the article. The most efficient way isn't always the best way.

For example, I asked Claude-3.7 to make my tests pass in my C# codebase. It did, however, it wrote code to detect if a test runner was running, then return true. The tests now passed, so, it achieved the goal, and the code diff was very small (10-20 lines.) The actual solution was to modify about 200-300 lines of code to add a feature (the tests were running a feature that did not yet exist.)

smath 27 March 2025
Reminds me of the term 'system identification' from old school control systems theory, which meant poking around a system and measuring how it behaves, - like sending an input impulse and measuring its response, does it have memory, etc.

https://en.wikipedia.org/wiki/System_identification

aithrowawaycomm 27 March 2025
I struggled reading the papers - Anthropic’s white papers reminds me of Stephen Wolfram, where it’s a huge pile of suggestive empirical evidence, but the claims are extremely vague - no definitions, just vibes - the empirical evidence seems selectively curated, and there’s not much effort spent building a coherent general theory.

Worse is the impression that they are begging the question. The rhyming example was especially unconvincing since they didn’t rule out the possibility that Claude activated “rabbit” simply because it wrote a line that said “carrot”; later Anthropic claimed Claude was able to “plan” when the concept “rabbit” was replaced by “green,” but the poem fails to rhyme because Claude arbitrarily threw in the word “green”! What exactly was the plan? It looks like Claude just hastily autocompleted. And Anthropic made zero effort to reproduce this experiment, so how do we know it’s a general phenomenon?

I don’t think either of these papers would be published in a reputable journal. If these papers are honest, they are incomplete: they need more experiments and more rigorous methodology. Poking at a few ANN layers and making sweeping claims about the output is not honest science. But I don’t think Anthropic is being especially honest: these are pseudoacademic infomercials.

matthiaspr 28 March 2025
Interesting paper arguing for deeper internal structure ("biology") beyond pattern matching in LLMs. The examples of abstraction (language-agnostic features, math circuits reused unexpectedly) are compelling against the "just next-token prediction" camp.

It sparked a thought: how to test this abstract reasoning directly? Try a prompt with a totally novel rule:

“Let's define a new abstract relationship: 'To habogink' something means to perform the action typically associated with its primary function, but in reverse. Example: The habogink of 'driving a car' would be 'parking and exiting the car'. Now, considering a standard hammer, what does it mean 'to habogink a hammer'? Describe the action.”

A sensible answer (like 'using the claw to remove a nail') would suggest real conceptual manipulation, not just stats. It tests if the internal circuits enable generalizable reasoning off the training data path. Fun way to probe if the suggested abstraction is robust or brittle.

fpgaminer 27 March 2025
> This is powerful evidence that even though models are trained to output one word at a time

I find this oversimplification of LLMs to be frequently poisonous to discussions surrounding them. No user facing LLM today is trained on next token prediction.

jacooper 27 March 2025
So it turns out, it's not just simple next token generation, there is intelligence and self developed solution methods (Algorithms) in play, particularly in the math example.

Also the multi language finding negates, at least partially, the idea that LLMs, at least large ones, don't have an understanding of the world beyond the prompt.

This changed my outlook regarding LLMs, ngl.

modeless 27 March 2025
> In the poetry case study, we had set out to show that the model didn't plan ahead, and found instead that it did.

I'm surprised their hypothesis was that it doesn't plan. I don't see how it could produce good rhymes without planning.

indigoabstract 27 March 2025
While reading the article I enjoyed pretending that a powerful LLM just crash landed on our planet and researchers at Anthropic are now investigating this fascinating piece of alien technology and writing about their discoveries. It's a black box, nobody knows how its inhuman brain works, but with each step, we're finding out more and more.

It seems like quite a paradox to build something but to not know how it actually works and yet it works. This doesn't seem to happen very often in classical programming, does it?

TechDebtDevin 27 March 2025
>>Claude will plan what it will say many words ahead, and write to get to that destination. We show this in the realm of poetry, where it thinks of possible rhyming words in advance and writes the next line to get there. This is powerful evidence that even though models are trained to output one word at a time, they may think on much longer horizons to do so.

This always seemed obvious to me or that LLMs were completing the next most likely sentence or multiple words.

deadbabe 27 March 2025
We really need to work on popularizing better, non-anthropomorphic terms for LLMs, as they don’t really have “thoughts” the way people think. Such terms make people more susceptible to magical thinking.
sgt101 28 March 2025
>Claude will plan what it will say many words ahead, and write to get to that destination. We show this in the realm of poetry, where it thinks of possible rhyming words in advance and writes the next line to get there. This is powerful evidence that even though models are trained to output one word at a time, they may think on much longer horizons to do so.

Models aren't trained to do next word prediction though - they are trained to do missing word in this text prediction.

osigurdson 27 March 2025
>> Claude can speak dozens of languages. What language, if any, is it using "in its head"?

I would have thought that there would be some hints in standard embeddings. I.e., the same concept, represented in different languages translates to vectors that are close to each other. It seems reasonable that an LLM would create its own embedding models implicitly.

jaakl 28 March 2025
My main takeaway here is that the models cannot tell know how they really work, and asking it from them is just returning whatever training dataset would suggest: how a human would explain it. So it does not have self-consciousness, which is of course obvious and we get fooled just like the crowd running away from the arriving train in Lumiére's screening. LLM just fails the famous old test "cogito ergo sum". It has no cognition, ergo they are not agents in more than metaphorical sense. Ergo we are pretty safe from AI singularity.
zerop 27 March 2025
The explanation of "hallucination" is quite simplified, I am sure there is more there.

If there is one problem I have to pick to to trace in LLMs, I would pick hallucination. More tracing of "how much" or "why" model hallucinated can lead to correct this problem. Given the explanation in this post about hallucination, I think degree of hallucination can be given as part of response to the user?

I am facing this in RAG use case quite - How do I know model is giving right answer or Hallucinating from my RAG sources?

mvATM99 27 March 2025
What a great article, i always like how much Anthropic focuses on explainability, something vastly ignored by most. The multi-step reasoning section is especially good food for thought.
SkyBelow 27 March 2025
>Claude speaks dozens of languages fluently—from English and French to Chinese and Tagalog. How does this multilingual ability work? Is there a separate "French Claude" and "Chinese Claude" running in parallel, responding to requests in their own language? Or is there some cross-lingual core inside?

I have an interesting test case for this.

Take a popular enough Japanese game that has been released for long enough for social media discussions to be in the training data, but not so popular to have an English release yet. Then ask it a plot question, something major enough to be discussed, but enough of a spoiler that it won't show up in marketing material. Does asking in Japanese have it return information that is lacking when asked in English, or can it answer the question in English based on the information in learned in Japanese?

I tried this recently with a JRPG that was popular enough to have a fan translation but not popular enough to have a simultaneous English release. English did not know the plot point, but I didn't have the Japanese skill to confirm if the Japanese version knew the plot point, or if discussion was too limited for the AI to be aware of it. It did know of the JRPG and did know of the marketing material around it, so it wasn't simply a case of my target being too niche.

0x70run 27 March 2025
I would pay to watch James Mickens comment on this stuff.
diedyesterday 28 March 2025
Regarding the conclusion about language-invariant reasoning (conceptual universality vs. multilingual processing) it helps understanding and becomes somewhat obvious if we regard each language as just a basis of some semantic/logical/thought space in the mind (analogous to the situation in linear algebra and duality of tensors and bases).

The thoughts/ideas/concepts/scenarios are invariant states/vector/points in the (very high dimensional) space of meanings in the mind and each language is just a basis to reference/define/express/manipulate those ideas/vectors. A coordinatization of that semantic space.

Personally, I'm a multilingual person with native-level command of several languages. Many times it happens, I remember having a specific thought, but don't remember in what language it was. So I can personally sympathize with this finding of the Anthropic researchers.

JPLeRouzic 27 March 2025
This is extremely interesting: The authors look at features (like making poetry, or calculating) of LLM production, make hypotheses about internal strategies to achieve the result, and experiment with these hypotheses.

I wonder if there is somewhere an explanation linking the logical operations made on a on dataset, are resulting in those behaviors?

YeGoblynQueenne 28 March 2025
>> Language models like Claude aren't programmed directly by humans—instead, they‘re trained on large amounts of data.

Gee, I wonder where this data comes from.

Let's think about this step by step.

So, what do we know? Language models like Claud are not programmed directly.

Wait, does that mean they are programmed indirectly?

If so, by whom?

Aha, I got it. They are not programmed, directly or indirectly. They are trained on large amounts of data.

But that is the question, right? Where does all that data come from?

Hm, let me think about it.

Oh hang on I got it!

Language models are trained on data.

But they are language models so the data is language.

Aha! And who generates language?

Humans! Humans generate language!

I got it! Language models are trained on language data generated by humans!

Wait, does that mean that language models like Claud are indirectly programmed by humans?

That's it! Language models like Claude aren't programmed directly by humans because they are indirectly programmed by humans when they are trained on large amounts of language data generated by humans!

ofrzeta 28 March 2025
Back to the "language of thought" question, this time with LLMs :) https://en.wikipedia.org/wiki/Language_of_thought_hypothesis
annoyingnoob 27 March 2025
Do LLMs "think"? I have trouble with the title, claiming that LLMs have thoughts.
teleforce 29 March 2025
Another review on the paper from MIT Technology Review [1].

[1] Anthropic can now track the bizarre inner workings of a large language model:

https://www.technologyreview.com/2025/03/27/1113916/anthropi...

davidmurphy 27 March 2025
On a somewhat related note, check out the video of Tuesday's Computer History Museum x IEEE Spectrum event, "The Great Chatbot Debate: Do LLMs Really Understand?"

Speakers: Sébastien Bubeck (OpenAI) and Emily M. Bender (University of Washington). Moderator: Eliza Strickland (IEEE Spectrum).

Video: https://youtu.be/YtIQVaSS5Pg Info: https://computerhistory.org/events/great-chatbot-debate/

jaehong747 28 March 2025
I’m skeptical of the claim that Claude “plans” its rhymes. The original example—“He saw a carrot and had to grab it, / His hunger was like a starving rabbit”—is explained as if Claude deliberately chooses “rabbit” in advance. However, this might just reflect learned statistical associations. “Carrot” strongly correlates with “rabbit” (people often pair them), and “grab it” naturally rhymes with “rabbit,” so the model’s activations could simply be surfacing common patterns.

The research also modifies internal states—removing “rabbit” or injecting “green”—and sees Claude shift to words like “habit” or end lines with “green.” That’s more about rerouting probabilistic paths than genuine “adaptation.” The authors argue it shows “planning,” but a language model can maintain multiple candidate words at once without engaging in human-like strategy.

Finally, “planning ahead” implies a top-down goal and a mechanism for sustaining it, which is a strong assumption. Transformative evidence would require more than observing feature activations. We should be cautious before anthropomorphizing these neural nets.

a3w 27 March 2025
Article and papers looks good. Video seems misleading, since I can use optimization pressure and local minima to explain the model behaviour. No "thinking" required, which the video claims is proven.
d--b 27 March 2025
> This is powerful evidence that even though models are trained to output one word at a time, they may think on much longer horizons to do so.

Suggesting that an awful lot of calculations are unnecessary in LLMs!

0xbadcafebee 27 March 2025
AI "thinks" like a piece of rope in a dryer "thinks" in order to come to an advanced knot: a whole lot of random jumbling that eventually leads to a complex outcome.
jxjnskkzxxhx 2 April 2025
CaNt UnDeRsTaNd WhY pEoPlE aRe BuLlIsH.

ItS jUsT a StOcHaStIc PaRtOt.

jasonjmcghee 28 March 2025
I'm completely hooked. This is such a good paper.

It hallucinating how it thinks through things is particularly interesting - not surprising, but cool to confirm.

I would LOVE to see Anthropic feed the replacement features output to the model itself and fine tune the model on how it thinks through / reasons internally so it can accurately describe how it arrived at its solutions - and see how it impacts its behavior / reasoning.

trhway 28 March 2025
>We find that the shared circuitry increases with model scale, with Claude 3.5 Haiku sharing more than twice the proportion of its features between languages as compared to a smaller model.

While it was already generally noticeable, still one more time confirmed that larger model generalizes better instead of using its bigger numbers of parameters just to “memorize by rote” (overfitting).

westurner 29 March 2025
XAI: Explainable artificial intelligence: https://en.wikipedia.org/wiki/Explainable_artificial_intelli...
kazinator 27 March 2025
> Claude writes text one word at a time. Is it only focusing on predicting the next word or does it ever plan ahead?

When a LLM outputs a word, it commits to that word, without knowing what the next word is going to be. Commits meaning once it settles on that token, it will not backtrack.

That is kind of weird. Why would you do that, and how would you be sure?

People can sort of do that too. Sometimes?

Say you're asked to describe a 2D scene in which a blue triangle partially occludes a red circle.

Without thinking about the relationship of the objects at all, you know that your first word is going to be "The" so you can output that token into your answer. And then that the sentence will need a subject which is going to be "blue", "triangle". You can commit to the tokens "The blue triangle" just from knowing that you are talking about a 2D scene with a blue triangle in it, without considering how it relates to anything else, like the red circle. You can perhaps commit to the next token "is", if you have a way to express any possible relationship using the word "to be", such as "the blue circle is partially covering the red circle".

I don't think this analogy necessarily fits what LLMs are doing.

Hansenq 27 March 2025
I wonder how much of these conclusions are Claude-specific (given that Anthropic only used Claude as a test subject) or if they extrapolate to other transformer-based models as well. Would be great to see the research tested on Llama and the Deepseek models, if possible!
hbarka 28 March 2025
Dario Amodei was in an interview where he said that OpenAI beat them (Anthropic) by mere days to be the first to release. That first move ceded the recognition to ChatGPT but according to Dario it could have been them just the same.
darkhorse222 28 March 2025
Once we are aware of these neural pathways I see no reason there shouldn't be a watcher and influencer of the pathways. A bit like a dystopian mind watcher. Shape the brain.
HocusLocus 27 March 2025
[Tracing the thoughts of a large language model]

"What have I gotten myself into??"

EncomLab 27 March 2025
This is very interesting - but like all of these discussions it sidesteps the issues of abstractions, compilation, and execution. It's fine to say things like "aren't programmed directly by humans", but the abstracted code is not the program that is running - the compiled code is - and that is code is executing within the tightly bounded constraints of the ISA it is being executed in.

Really this is all so much slight of hand - as an esolang fanatic this all feels very familiar. Most people can't look a program written in Whitespace and figure it out either, but once compiled it is just like every other program as far as the processor is concerned. LLM's are no different.

twoodfin 28 March 2025
I say this at least 82.764% in jest:

Don’t these LLM’s have The Bitter Lesson in their training sets? What are they doing building specialized structures to handle specific needs?

teleforce 29 March 2025
Oh the irony of not able to download the entire paper in one compact PDF format referred in the article, while apparently all the reference citations have PDF of the cited article to be downloaded and accessible from the provided online links [1].

Come on Anthropic, you can do much better than this unconventional and bizarre approach to publication.

[1] On the Biology of a Large Language Model:

https://transformer-circuits.pub/2025/attribution-graphs/bio...

LoganDark 27 March 2025
LLMs don't think, and LLMs don't have strategies. Maybe it could be argued that LLMs have "derived meaning", but all LLMs do is predict the next token. Even RL just tweaks the next-token prediction process, but the math that drives an LLM makes it impossible for there to be anything that could reasonably be called thought.
greesil 27 March 2025
What is a "thought"?
kittikitti 27 March 2025
What's the point of this when Claude isn't open sourced and we just have to take Anthropic's word for it?
alach11 27 March 2025
Fascinating papers. Could deliberately suppressing memorization during pretraining help force models to develop stronger first-principles reasoning?
navaed01 28 March 2025
When and how to do stop saying LLMs are predicting the nect set of tokens and start saying they are thinking? Is this the point?
rambambram 27 March 2025
When I want to trace the 'thoughts' of my programs, I just read the code and comments I wrote.

Stop LLM anthropomorphizing, please. #SLAP