I spent a good part of my career (nearly a decade) at Google working on getting Clang to build the linux kernel. https://clangbuiltlinux.github.io/
This LLM did it in (checks notes):
> Over nearly 2,000 Claude Code sessions and $20,000 in API costs
It may build, but does it boot (was also a significant and distinct next milestone)? (Also, will it blend?). Looks like yes!
> The 100,000-line compiler can build a bootable Linux 6.9 on x86, ARM, and RISC-V.
The next milestone is:
Is the generated code correct? The jury is still out on that one for production compilers. And then you have performance of generated code.
> The generated code is not very efficient. Even with all optimizations enabled, it outputs less efficient code than GCC with all optimizations disabled.
This is a much more reasonable take than the cursor-browser thing. A few things that make it pretty impressive:
> This was a clean-room implementation (Claude did not have internet access at any point during its development); it depends only on the Rust standard library. The 100,000-line compiler can build Linux 6.9 on x86, ARM, and RISC-V. It can also compile QEMU, FFmpeg, SQlite, postgres, redis
> I started by drafting what I wanted: a from-scratch optimizing compiler with no dependencies, GCC-compatible, able to compile the Linux kernel, and designed to support multiple backends. While I specified some aspects of the design (e.g., that it should have an SSA IR to enable multiple optimization passes) I did not go into any detail on how to do so.
> Previous Opus 4 models were barely capable of producing a functional compiler. Opus 4.5 was the first to cross a threshold that allowed it to produce a functional compiler which could pass large test suites, but it was still incapable of compiling any real large projects.
And the very open points about limitations (and hacks, as cc loves hacks):
> It lacks the 16-bit x86 compiler that is necessary to boot [...] Opus was unable to implement a 16-bit x86 code generator needed to boot into 16-bit real mode. While the compiler can output correct 16-bit x86 via the 66/67 opcode prefixes, the resulting compiled output is over 60kb, far exceeding the 32k code limit enforced by Linux. Instead, Claude simply cheats here and calls out to GCC for this phase
> It does not have its own assembler and linker;
> Even with all optimizations enabled, it outputs less efficient code than GCC with all optimizations disabled.
Ending with a very down to earth take:
> The resulting compiler has nearly reached the limits of Opus’s abilities. I tried (hard!) to fix several of the above limitations but wasn’t fully successful. New features and bugfixes frequently broke existing functionality.
All in all, I'd say it's a cool little experiment, impressive even with the limitations, and a good test-case as the author says "The resulting compiler has nearly reached the limits of Opus’s abilities". Yeah, that's fair, but still highly imrpessive IMO.
It used the best tests it could find for existing compilers. This is effectively steering Claude to a well-defined solution.
Hard to find fully specified problems like this in the wild.
I think this is more a testament to small, well-written tests than it is agent teams. I imagine you could do the same thing with any frontier model and a single agent in a linear flow.
I don’t know why people use parallel agents and increase accidental complexity. Isn’t one agent fast enough? Why lose accuracy over +- one week to write a compiler?
> Write extremely high-quality tests
> Claude will work autonomously to solve whatever problem I give it. So it’s important that the task verifier is nearly perfect, otherwise Claude will solve the wrong problem. Improving the testing harness required finding high-quality compiler test suites, writing verifiers and build scripts for open-source software packages, and watching for mistakes Claude was making, then designing new tests as I identified those failure modes.
> For example, near the end of the project, Claude started to frequently break existing functionality each time it implemented a new feature. To address this, I built a continuous integration pipeline and implemented stricter enforcement that allowed Claude to better test its work so that new commits can’t break existing code.
If I, a human, read the source code of $THING and then later implement my own version, that's not a "clean-room" re-implementation. The whole point of "clean-room" is that no single person has access to both the original code and the new code. (That way, you can legally prove that no copyright infringement took place.)
But when an AI does it, now it counts? Opus is trained on the source code of Clang, GCC, TCC, etc. So this is not "clean-room".
It's weird to see the expectation that the result should be perfect.
All said and done, that its even possible is remarkable. Maybe these all go into training the next Opus or Sonnet and we start getting models that can create efficient compilers from scratch. That would be something!
My second reaction: still incredible, but noting that a C compiler is one of the most rigorously specified pieces of software out there. The spec is precise, the expected behavior is well-defined, and test cases are unambiguous.
I'm curious how well this translates to the kind of work most of us do day-to-day where requirements are fuzzy, many edge cases are discovered on the go, and what we want to build is a moving target.
People focused on the flaws are missing the picture. Opus wasn't even trained to be "a member of a team of engineers," it was adapted to the task by one person with a shell script loop. Specific training for this mode of operation is inevitable. And model "IQ" is increasing with every generation. If human IQ is increasing at all, it's only because the engineer pool is shrinking more at one end than the other.
This is a five-alarm fire if you're a SWE and not retiring in the next couple years.
>The fix was to use GCC as an online known-good compiler oracle to compare against.
>This was a clean-room implementation (Claude did not have internet access at any point during its development); it depends only on the Rust standard library.
How does one re-conciliate both of this statements? Sure one can fetch all of gnu.org in local, and a model which already scrapped the whole internet somehow already integrated it in its weights, didn’t it?
The worldwide median household income (as of 2013 data from Gallup) was approximately $9,733 per year (in PPP, current international dollars).
This means that $20,000 per year is more than double the global median income.
A median Luxembourg citizen earns $20,000 in about 5 to 6 months of work, a Burundi one would on median need 42.5 months, that is 3.5 years.
> This was a clean-room implementation (Claude did not have internet access at any point during its development); it depends only on the Rust standard library. The 100,000-line compiler can build Linux 6.9 on x86, ARM, and RISC-V. It can also compile QEMU, FFmpeg, SQlite, postgres, redis, and has a 99% pass rate on most compiler test suites including the GCC torture test suite. It also passes the developer's ultimate litmus test: it can compile and run Doom.
This is incredible!
But it also speaks to the limitations of these systems: while these agentic systems can do amazing things when automatically-evaluable, robust test suites exist... you hit diminishing returns when you, as a human orchestrator of agentic systems, are making business decisions as fast as the AI can bring them to your attention. And that assumes the AI isn't just making business assumptions with the same lack of context, compounded with motivation to seem self-reliant, that a non-goal-aligned human contractor would have.
> when agents started to compile the Linux kernel, they got stuck. [...] Every agent would hit the same bug, fix that bug, and then overwrite each other's changes.
> [...] The fix was to use GCC as an online known-good compiler oracle to compare against. I wrote a new test harness that randomly compiled most of the kernel using GCC, and only the remaining files with Claude's C Compiler. If the kernel worked, then the problem wasn’t in Claude’s subset of the files. If it broke, then it could further refine by re-compiling some of these files with GCC. This let each agent work in parallel
This is a remarkably creative solution! Nicely done.
This is like a working version of the Cursor blog. The evidence - it compiling the Linux kernel - is much more impressive than a browser that didn't even compile (until manually intervened)
We live a wonderful time where I can spend hours and $20000 to build a C compiler which is slow and inefficient and anyway requires an existing great compiler to even work, and then neither I nor the agent has any idea on how to make it useful :D
This is very much a "vibe coding can build you the Great Pyramids but it can't build a cathedral" situation, as described earlier today: https://news.ycombinator.com/item?id=46898223
I know this is an impressive accomplishment and is meant to show us the future potential, but it achieves big results by throwing an insane amount of compute at the problem, brute forcing its way to functionality. $20,000 set on fire, at Claude's discounted Max pricing no less.
Linear results from exponential compute is not nothing, but this certain feels like a dead end approach. The frontier should be more complexity for less compute, not more complexity from an insane amount more compute.
It's cool that you can look at the git history to see what it did. Unfortunately, I do not see any of the human written prompts (?).
First 10 commits, "git log --all --pretty=format:%s --reverse | head",
Initial commit: empty repo structure
Lock: initial compiler scaffold task
Initial compiler scaffold: full pipeline for x86-64, AArch64, RISC-V
Lock: implement array subscript and lvalue assignments
Implement array subscript, lvalue assignments, and short-circuit evaluation
Add idea: type-aware codegen for correct sized operations
Lock: type-aware codegen for correct sized operations
Implement type-aware codegen for correct sized operations
Lock: implement global variable support
Implement global variable support across all three backends
> To stress test it, I tasked 16 agents with writing a Rust-based C compiler, from scratch, capable of compiling the Linux kernel. Over nearly 2,000 Claude Code sessions and $20,000 in API costs, the agent team produced a 100,000-line compiler that can build Linux 6.9 on x86, ARM, and RISC-V.
If you don't care about code quality, maintainability, readability, conformance to the specification, and performance of the compiler and of the compiled code, please, give me your $20,000, I'll give you your C compiler written from scratch :)
I try to see this like F1 racing.
Building a browser or a C compiler with agent swarms is disconnected from the reality of normal software projects. In normal projects the requirements are not full understood upfront and you learn and adapt and change as you make progress.
But the innovations from professional racing result in better cars for everyone.
We'll probably get better dev tools and better coding agents thanks to those experiments.
A C Compiler seems like one of the more straightforward things to have done. Reading this gives me the same vibe as when a magician does a frequently done trick (saw someone in half, etc).
I'd be more interested in letting it have a go at some some of the other "less trodden" paths of computing. Some of the things that would "wow me more":
- Build a BEAM alternative, perhaps in an embedded space
- Build a Smalltalk VM, perhaps in an embedded space, or in WASM
These things are documented at some level, but still require a bit of original thinking to execute and pull off. That would wow me more.
1) obvious green field project
2) well defined spec which will definitely be in the training data
3) an end result which lands you 90% from the finish
Now comes the hard part, the last 10%. Still not impressed here. Since fixing issues in the end was impossible without introducing bugs I have doubts about quality
I'm glad they do call it out in the end. That's fair
My question would be: what are the myriad other projects you tasked Opus 4.6 to build and it could not get to a point you could kinda-sorta make a post about?
This kind of headline makes me think of p-hacking.
Next time can you build a Rust compiler in C? It doesn't even have to check things or have a borrow checker, as long as it reduces the compile times so it's like a fast debug iteration compiler.
Maybe I'm naive, but I find these re-engineering complex product posts underwhelming. C Compilers exist and realistically Claudes training corpus contains a ton of C Compiler code. The task is already perfectly defined. There exists a benchmark of well-adopted codebases that can be used to prove if this is a working solution. Half the difficulty in making something is proving it works and is complete.
IMO a simpler novel product that humans enjoy is 10x more impressive than rehashing a solved problem, regardless of difficulty.
The comments at [1] are a bit _too_ trollish for me, but they _do_ showcase that this compiler is far too lenient on what it accepts to the point where I'd hesitate to call it ... a C compiler (This [2] comment in particular is pretty damning).
Still, an impressive achievement nonetheless, but there's a lot of nuance under the surface.
> Claude will work autonomously to solve whatever problem I give it. So it’s important that the task verifier is nearly perfect, otherwise Claude will solve the wrong problem.
I think this is the fundamental thing here with AI. You can spin up infinite agents that can all do....stuff. But how do you keep them from doing the wrong stuff?
Is writing an airtight spec and test harness easier or less time consuming than just keeping a human in the loop and verifying and redirecting as the agents work?
Very cool, but I can't help but wonder how this translates to similarly complex projects where innate knowledge about the domain hasn't been embedded in the LLM via training data. There's a wealth of open source compiler code and related research papers that have been fed to the LLM. It seems like that would advantage the LLM significantly.
As cool as the result is, this article is quite tone death to the fact that they asked a statistical model to "build" what was already in its training dataset... And not to mention with troves of forum data discussing bugs and best practices.
The interesting thing here is what's this code worth (in money terms)? I would say it's worth only the cost of recreation, apparently $20,000, and not very much more. Perhaps you can add a bit for the time taken to prompt it. Anyone who can afford that can use the same prompt to generate another C compiler, and another one and another one.
GCC and Clang are worth much much more because they are battle-tested compilers that we understand and know work, even in a multitude of corner cases, over decades.
In future there's going to be lots and lots of basically worthless code, generated and regenerated over and over again. What will distinguish code that provides value? It's going to be code - however it was created, could be AI or human - that has actually been used and maintained in production for a long time, with a community or company behind it, bugs being triaged and fixed and so on.
Clicked on the first thing I happen to be interested in - SIMD stuff - and ended up at https://github.com/anthropics/claudes-c-compiler/blob/6f1b99..., which is a fast path incompatible with the _mm_free implementation; pretty trivial bug, not even actually SIMD or anything specialized at all.
A whole lot of UB in the actual SIMD impls (who'd have expected), but that can actually be fine here if the compiler is made to not take advantage of the UB. And then there's the super-weird mix of manual loops vs inline assembly vs builtins.
However it was achieved, building a such a complex project like a C compiler on a 20k $ budget in full autonomy is quite impressive.
Imho some commenters focus way too much on the (many, and honestly also shared by the blog post too) cons, that they forget to be genuinely impressed by the steps forward.
How about we get the LLM's to collaborate and design a perfect programming language for LLM coding, it would be terse (less tokens) easy for pattern searches etc and very fast to build, iterate over.
Cool article, interesting to read about their challenges. I've tasked Claude with building an Ada83 compiler targeting LLVM IR - which has gotten pretty far.
I am not using teams though and there is quite a bit of knowledge needed to direct it (even with the test suite).
But by some definition my "Ctrl", "C", and "V" keys can build a C compiler...
Obviously being facetious but my point being: I find it impossible to judge how impressed I should be by these model achievements since they don't show how they perform on a range of out-of-distribution tasks.
So I do think one can get value from coding agents, but that value is out of proportion compared to the investments made by the AI labs, so now they're pushing this kind of stuff which I find to be a borderline scam.
Let me explain why:
> the resulting compiled output is over 60kb, far exceeding the 32k code limit enforced by Linux
Seems like a failure to me.
> I tried (hard!) to fix several of the above limitations but wasn’t fully successful. New features and bugfixes frequently broke existing functionality.
This has code smell written all over it.
----
Conclusion: this cost 20k to build, not taking into account the money spent on training the model. How much would you pay for this software? Zero.
The reality is that LLM are up there with SQL and ROR(or above) in terms of changing how people write software and interact with data. That's a big deal, but not enough to support trillion dollar valuations.
So you get things like this project, which are just about driving a certain narrative.
I'm not particularly impressed that it can turn C into an SSA IR or assembly etc. The optimizations, however sophisticated is where anything impressive would be. Then again, we have lots of examples in the training set I would expect. C compilers are probably the most popular of all compilers. What would be more impressive is for it to have made a compiler for a well defined language that isn't very close to a popular language.
What I am impressed by is that the task it completed had many steps and the agent didn't get lost or caught in a loop in the many sessions and time it spent doing it.
I think we’re getting to a place where for anything with extensive verification available we’ll be “fitting” code to a task against tests like we fit an ML model to a loss function.
Now this is fairly "easy" as there are multitude of implementations/specs all over the Internet. How about trying to design a new language that is unquestionably better/safer/faster for low-level system programming than C/Rust/Zig? ML is great in aping existing stuff but how about pushing it to invent something valuable instead?
This is a very early research prototype with no other inter-agent communication methods or high-level goal management processes."
The lock file approach (current_tasks/parse_if_statement.txt) prevents two agents from claiming the same task, but it can't prevent convergent wasted work. When all 16 agents hit the same Linux kernel bug, the lock files didn't help — the problem wasn't task collision, it was that the agents couldn't see they were all solving the same downstream failure. The GCC oracle workaround was clever, but it was a human inventing a new harness mid-flight because the coordination primitive wasn't enough.
Similarly, "Claude frequently broke existing functionality implementing new features" isn't a model capability problem — it's an input stability problem. Agent N builds against an interface that agent M just changed. Without gating on whether your inputs have changed since you started, you get phantom regressions
I think the good thing about it is that if you are given good specification, you are likely to get good result. Writing a C compiler is not something new, but it will be great for all the porting projects.
Most of the effort when writing a compiler is handling incorrect code, and reporting sensible error messages. Compiling known good code is a great start though.
What I find to be the most impressive part here is that it wrote the compiler without reference to the C specification and without architecture manuals at hand.
I'm annoyed at the cost statement, as that's the sleight of hand. "$20000" at current pricing. Add some orders of magnitude to the costs and you'll get your true price you'll have to pay when the VC money starts to wear off. 2nd, this is ignoring the dev time that he/others put in over multiple iterations of this project (opus 4, opus 4.5) and all the other work to create the scaffolding for it, and all the millions/tens of millions of dollars of hand written test suits (linux kernel, gcc, doom, sqlite, etc) he got to use to guide the process. So add some more cost on top of that orders of magnitude increase and the dev time is probably a couple months/years more than "2 weeks".
And this is just working off the puff pieces statements, and not even diving into the code to see it's limits/origins, etc. I also don't see the scaffold in the repo, as that's where the effort is.
But still it's not surprising, from my own experience, given a rigorously definable problem, enough effort, grunt work, and massaging, you can get stuff out of the current models.
They should add this to the benchmark suite, and create a custom eval for how good the resulting compiler is, as well as how maintainable the source code.
Thinking about the this, while it’s a cool achievement, how useful is it really? It realizes on the fact there is a large comprehensive set of tests and a large number of available projects that can function as tests.
That situation is extremely uncommon for most development
I really love how they waste energy for stuff like this. Even better, all that nonsense talk we constantly kept hearing about energy crysis just a few years ago...
How much of this result is effectively plagiarized open source compiler code? I don't understand how this is compelling at all: obviously it can regurgitate things that are nearly identical in capability to already existing code it was explicitly trained on...
It's very telling how all these examples are all "look, we made it recreate a shitter version of a thing that already exists in the training set".
It means that if you already have or a willing to build very robust test suite and the task is a complicated but already solved problem, you can get a sub-par implementation for a semi-reasonable amount of money.
> I tried (hard!) to fix several of the above limitations but wasn’t fully successful. New features and bugfixes frequently broke existing functionality.
This has been my experience of vibe coding too. Good for getting started, but you quickly reach the point where fixing one thing breaks another and you have to finish the project yourself.
I'm sure this is impressive, but it's probably not the best test case given how many C compilers there are out there and how they presumably have been featured in the training data.
This is almost like asking me to invent a path finding algorithm when I've been thought Dijkstra's and A*.
I will say that one thing that's extremely interesting is that everyone laughed at and made fun of Steve Yegge when he released Gas Town, which centered exactly around this idea — of having more than a dozen agents working on a project simultaneously with some generalized agents focusing on implementing features while other are more specialized and tasked with second-order tasks, where you just independently run them in a loop from an orchestrator until they've finished the project where they all work on work trees and, you know, satisfy merch conflicts and stuff as a coordination mechanism — but it's starting to kind of look like he was right. He really was aiming for where the puck was headed. First we got cursor with the fast render browser, then we got Kimi K2.5 releasing with — from everything I can tell — actually very innovative and new specific RL techniques for orchestrating agent swarms. And now we have this, Anthropic themselves doing a Gas Town-style agent swarm model of development. It's beginning to look like he absolutely did know where the puck was headed before it got there.
Now, whether we should actually be building software in this fashion or even headed in this direction at all is a completely separate question. And I would tend strongly towards no. Not until at least we have very strong, yet easy to use concise and low effort formal verification, deterministic simulation testing, property-based testing, integration testing, etc; and even then, we'll end up pair programming those formal specifications and batteries of tests with AI agents. Not writing them ourselves, since that's inefficient, nor turning them over to agent swarms, since they are very important. And if we turn them over to swarms, we'd end up with an infinite regress problem. And ultimately, that's just programming at a higher level at that point. So I would argue we should never predominantly develop in this way.
But still, there is prescience in Gastown apparently, and that's interesting.
Interesting that they are still going with a testing strategy despite the wasted time. I think in the long run model checking and proofs are more scale-able.
I guess it makes as agents can generate tests, since you are taking this route I'd like to see agents that act as a users, that can only access docs, textbooks, user forums and builds.
Brute forcing a problem with a perfect test oracle and a really good heuristic (how many c compilers are in the training data) is not enough to justify the hype imo.
Yes this is cool. I actually have worked on a similar project with a slightly worse test oracle and would gladly never have to do that sort of work again. Just tedious unfulfilling work. Though we caught issues with both the specifications/test oracle when doing the work. Also many of the team members learned and are now SMEs for related systems.
Is this evidence that knowledge work is dead or AGI is coming? Absolutely not. I think you’d be pretty ignorant with respect to the field to suggest such a thing.
> This was a clean-room implementation (Claude did not have internet access at any point during its development);
This is absolutely false and I wish the people doing these demonstrations were more honest.
It had access to GCC! Not only that, using GCC as an oracle was critical and had to be built in by hand.
Like the web browser project this shows how far you can get when you have a reference implementation, good benchmarks, and clear metrics. But that's not the real world for 99% of people, this is the easiest scenario for any ML setting.
There's a terrible bug where once it compacts then it sometimes pulls in .o or binary files and immediately fills your entire context. Then it compacts again...10m and your token budget is gone for the 5 hour period. edit: hooks that prevent it from reading binary files can't prevent this.
Cool project, but they really could have skipped the mention of clean room. Something trained on every copyrighted thing known to mankind is the opposite of clean room
> The generated code is not very efficient. Even with all optimizations enabled, it outputs less efficient code than GCC with all optimizations disabled.
Worse than "-O0" takes skill...
So then, it produced something much worse than tcc (which is better than gcc -O0), an equivalent of which one man can produce in under two weeks. So even all those tokens and dollars did not equal one man's week of work.
Except the one man might explain such arbitrary and shitty code as this:
Oh god the more i look at this code the happier I get. I can already feel the contracts coming to fix LLM slop like this when any company who takes this seriously needs it maintained and cannot...
> So, while this experiment excites me, it also leaves me feeling uneasy. Building this compiler has been some of the most fun I’ve had recently, but I did not expect this to be anywhere near possible so early in 2026
What? Didn’t cursed lang do something similar like 6 or 7 months ago? These bombastic marketing tactics are getting tired.
This chatbot has several C compilers in its training data. How is this possibly a useful benchmark for anything? LLMs routinely output code verbatim or modulo trivial changes as their own (very useful for license-laundering too).
> We tasked Opus 4.6 using agent teams to build a C Compiler
So, essentially to build something for which many, many examples already exist on the web, and which is likely baked into its training set somehow ... mmmyeah.
Can it create employment? How is this making life better.
I understand the achievement but come on, wouldn´t it be something to show if you created employment for 10000 people using your 20000 USD!
Microsoft, OpenAI, Anthropic, XAI, all solving the wrong problems, your problems not the collective ones.
The title should have said "Antropic stole GCC and other open-source compiler code to create a subpar, non-functional compiler", without attribution or compensation. Open source was never meant for thieving megacorps like them.
You could hire a reasonably skilled dev in India for a week for $1k —- or you could pay $20k in LLM tokens, spend 2 hours writing essays to explain what you want, and then get a buggy mess.
We tasked Opus 4.6 using agent teams to build a C Compiler
(anthropic.com)707 points by modeless 5 February 2026 | 688 comments
Comments
This LLM did it in (checks notes):
> Over nearly 2,000 Claude Code sessions and $20,000 in API costs
It may build, but does it boot (was also a significant and distinct next milestone)? (Also, will it blend?). Looks like yes!
> The 100,000-line compiler can build a bootable Linux 6.9 on x86, ARM, and RISC-V.
The next milestone is:
Is the generated code correct? The jury is still out on that one for production compilers. And then you have performance of generated code.
> The generated code is not very efficient. Even with all optimizations enabled, it outputs less efficient code than GCC with all optimizations disabled.
Still a really cool project!
> This was a clean-room implementation (Claude did not have internet access at any point during its development); it depends only on the Rust standard library. The 100,000-line compiler can build Linux 6.9 on x86, ARM, and RISC-V. It can also compile QEMU, FFmpeg, SQlite, postgres, redis
> I started by drafting what I wanted: a from-scratch optimizing compiler with no dependencies, GCC-compatible, able to compile the Linux kernel, and designed to support multiple backends. While I specified some aspects of the design (e.g., that it should have an SSA IR to enable multiple optimization passes) I did not go into any detail on how to do so.
> Previous Opus 4 models were barely capable of producing a functional compiler. Opus 4.5 was the first to cross a threshold that allowed it to produce a functional compiler which could pass large test suites, but it was still incapable of compiling any real large projects.
And the very open points about limitations (and hacks, as cc loves hacks):
> It lacks the 16-bit x86 compiler that is necessary to boot [...] Opus was unable to implement a 16-bit x86 code generator needed to boot into 16-bit real mode. While the compiler can output correct 16-bit x86 via the 66/67 opcode prefixes, the resulting compiled output is over 60kb, far exceeding the 32k code limit enforced by Linux. Instead, Claude simply cheats here and calls out to GCC for this phase
> It does not have its own assembler and linker;
> Even with all optimizations enabled, it outputs less efficient code than GCC with all optimizations disabled.
Ending with a very down to earth take:
> The resulting compiler has nearly reached the limits of Opus’s abilities. I tried (hard!) to fix several of the above limitations but wasn’t fully successful. New features and bugfixes frequently broke existing functionality.
All in all, I'd say it's a cool little experiment, impressive even with the limitations, and a good test-case as the author says "The resulting compiler has nearly reached the limits of Opus’s abilities". Yeah, that's fair, but still highly imrpessive IMO.
Hard to find fully specified problems like this in the wild.
I think this is more a testament to small, well-written tests than it is agent teams. I imagine you could do the same thing with any frontier model and a single agent in a linear flow.
I don’t know why people use parallel agents and increase accidental complexity. Isn’t one agent fast enough? Why lose accuracy over +- one week to write a compiler?
> Write extremely high-quality tests
> Claude will work autonomously to solve whatever problem I give it. So it’s important that the task verifier is nearly perfect, otherwise Claude will solve the wrong problem. Improving the testing harness required finding high-quality compiler test suites, writing verifiers and build scripts for open-source software packages, and watching for mistakes Claude was making, then designing new tests as I identified those failure modes.
> For example, near the end of the project, Claude started to frequently break existing functionality each time it implemented a new feature. To address this, I built a continuous integration pipeline and implemented stricter enforcement that allowed Claude to better test its work so that new commits can’t break existing code.
But when an AI does it, now it counts? Opus is trained on the source code of Clang, GCC, TCC, etc. So this is not "clean-room".
All said and done, that its even possible is remarkable. Maybe these all go into training the next Opus or Sonnet and we start getting models that can create efficient compilers from scratch. That would be something!
My second reaction: still incredible, but noting that a C compiler is one of the most rigorously specified pieces of software out there. The spec is precise, the expected behavior is well-defined, and test cases are unambiguous.
I'm curious how well this translates to the kind of work most of us do day-to-day where requirements are fuzzy, many edge cases are discovered on the go, and what we want to build is a moving target.
This is a five-alarm fire if you're a SWE and not retiring in the next couple years.
>This was a clean-room implementation (Claude did not have internet access at any point during its development); it depends only on the Rust standard library.
How does one re-conciliate both of this statements? Sure one can fetch all of gnu.org in local, and a model which already scrapped the whole internet somehow already integrated it in its weights, didn’t it?
The worldwide median household income (as of 2013 data from Gallup) was approximately $9,733 per year (in PPP, current international dollars). This means that $20,000 per year is more than double the global median income.
A median Luxembourg citizen earns $20,000 in about 5 to 6 months of work, a Burundi one would on median need 42.5 months, that is 3.5 years.
https://worldpopulationreview.com/country-rankings/median-in...
This is incredible!
But it also speaks to the limitations of these systems: while these agentic systems can do amazing things when automatically-evaluable, robust test suites exist... you hit diminishing returns when you, as a human orchestrator of agentic systems, are making business decisions as fast as the AI can bring them to your attention. And that assumes the AI isn't just making business assumptions with the same lack of context, compounded with motivation to seem self-reliant, that a non-goal-aligned human contractor would have.
> [...] The fix was to use GCC as an online known-good compiler oracle to compare against. I wrote a new test harness that randomly compiled most of the kernel using GCC, and only the remaining files with Claude's C Compiler. If the kernel worked, then the problem wasn’t in Claude’s subset of the files. If it broke, then it could further refine by re-compiling some of these files with GCC. This let each agent work in parallel
This is a remarkably creative solution! Nicely done.
- All prompts used
- The structure of the agent team (which agents / which roles)
- Any other material that went into the process
This would be a good source for learning, even though I'm not ready to spend 20k$ just for replicating the experiment.
I know this is an impressive accomplishment and is meant to show us the future potential, but it achieves big results by throwing an insane amount of compute at the problem, brute forcing its way to functionality. $20,000 set on fire, at Claude's discounted Max pricing no less.
Linear results from exponential compute is not nothing, but this certain feels like a dead end approach. The frontier should be more complexity for less compute, not more complexity from an insane amount more compute.
First 10 commits, "git log --all --pretty=format:%s --reverse | head",
If you don't care about code quality, maintainability, readability, conformance to the specification, and performance of the compiler and of the compiled code, please, give me your $20,000, I'll give you your C compiler written from scratch :)
I'd be more interested in letting it have a go at some some of the other "less trodden" paths of computing. Some of the things that would "wow me more":
- Build a BEAM alternative, perhaps in an embedded space
- Build a Smalltalk VM, perhaps in an embedded space, or in WASM
These things are documented at some level, but still require a bit of original thinking to execute and pull off. That would wow me more.
1) obvious green field project 2) well defined spec which will definitely be in the training data 3) an end result which lands you 90% from the finish
Now comes the hard part, the last 10%. Still not impressed here. Since fixing issues in the end was impossible without introducing bugs I have doubts about quality
I'm glad they do call it out in the end. That's fair
This kind of headline makes me think of p-hacking.
IMO a simpler novel product that humans enjoy is 10x more impressive than rehashing a solved problem, regardless of difficulty.
Still, an impressive achievement nonetheless, but there's a lot of nuance under the surface.
[1] https://github.com/anthropics/claudes-c-compiler/issues/1
[2] https://github.com/anthropics/claudes-c-compiler/issues/1#is...
I think this is the fundamental thing here with AI. You can spin up infinite agents that can all do....stuff. But how do you keep them from doing the wrong stuff?
Is writing an airtight spec and test harness easier or less time consuming than just keeping a human in the loop and verifying and redirecting as the agents work?
It all still comes back to context management.
Very cool demonstration of the tech though.
GCC and Clang are worth much much more because they are battle-tested compilers that we understand and know work, even in a multitude of corner cases, over decades.
In future there's going to be lots and lots of basically worthless code, generated and regenerated over and over again. What will distinguish code that provides value? It's going to be code - however it was created, could be AI or human - that has actually been used and maintained in production for a long time, with a community or company behind it, bugs being triaged and fixed and so on.
A whole lot of UB in the actual SIMD impls (who'd have expected), but that can actually be fine here if the compiler is made to not take advantage of the UB. And then there's the super-weird mix of manual loops vs inline assembly vs builtins.
Imho some commenters focus way too much on the (many, and honestly also shared by the blog post too) cons, that they forget to be genuinely impressed by the steps forward.
If Claude had NOT been trained on compiler code, it would NOT have been able to build a compiler.
Definitely signals the end of software IP or at least in its present form.
I am not using teams though and there is quite a bit of knowledge needed to direct it (even with the test suite).
But by some definition my "Ctrl", "C", and "V" keys can build a C compiler...
Obviously being facetious but my point being: I find it impossible to judge how impressed I should be by these model achievements since they don't show how they perform on a range of out-of-distribution tasks.
Let me explain why:
> the resulting compiled output is over 60kb, far exceeding the 32k code limit enforced by Linux
Seems like a failure to me.
> I tried (hard!) to fix several of the above limitations but wasn’t fully successful. New features and bugfixes frequently broke existing functionality.
This has code smell written all over it.
----
Conclusion: this cost 20k to build, not taking into account the money spent on training the model. How much would you pay for this software? Zero.
The reality is that LLM are up there with SQL and ROR(or above) in terms of changing how people write software and interact with data. That's a big deal, but not enough to support trillion dollar valuations.
So you get things like this project, which are just about driving a certain narrative.
What I am impressed by is that the task it completed had many steps and the agent didn't get lost or caught in a loop in the many sessions and time it spent doing it.
- trained on all the GCC/clang source - pulled down a kernel branch, presumably with extensive tests in source - used GCC as an oracle
I certainly wouldn't be able to do this.
I flip flop man.
The lock file approach (current_tasks/parse_if_statement.txt) prevents two agents from claiming the same task, but it can't prevent convergent wasted work. When all 16 agents hit the same Linux kernel bug, the lock files didn't help — the problem wasn't task collision, it was that the agents couldn't see they were all solving the same downstream failure. The GCC oracle workaround was clever, but it was a human inventing a new harness mid-flight because the coordination primitive wasn't enough.
Similarly, "Claude frequently broke existing functionality implementing new features" isn't a model capability problem — it's an input stability problem. Agent N builds against an interface that agent M just changed. Without gating on whether your inputs have changed since you started, you get phantom regressions
I need to reunderwrite what my vision of the future looks like.
Well there goes my weekend project plans
its funny bacause by (most) definitions, it is not an artifact:
> a usually simple object (such as a tool or ornament) showing human workmanship or modification as distinguished from a natural object
And this is just working off the puff pieces statements, and not even diving into the code to see it's limits/origins, etc. I also don't see the scaffold in the repo, as that's where the effort is.
But still it's not surprising, from my own experience, given a rigorously definable problem, enough effort, grunt work, and massaging, you can get stuff out of the current models.
Are Anthropic's claims reproducible?
i had to killall -9 claude 3 times yesterday
That situation is extremely uncommon for most development
It's very telling how all these examples are all "look, we made it recreate a shitter version of a thing that already exists in the training set".
This is not entirely ridiculous.
This has been my experience of vibe coding too. Good for getting started, but you quickly reach the point where fixing one thing breaks another and you have to finish the project yourself.
This is almost like asking me to invent a path finding algorithm when I've been thought Dijkstra's and A*.
Now, whether we should actually be building software in this fashion or even headed in this direction at all is a completely separate question. And I would tend strongly towards no. Not until at least we have very strong, yet easy to use concise and low effort formal verification, deterministic simulation testing, property-based testing, integration testing, etc; and even then, we'll end up pair programming those formal specifications and batteries of tests with AI agents. Not writing them ourselves, since that's inefficient, nor turning them over to agent swarms, since they are very important. And if we turn them over to swarms, we'd end up with an infinite regress problem. And ultimately, that's just programming at a higher level at that point. So I would argue we should never predominantly develop in this way.
But still, there is prescience in Gastown apparently, and that's interesting.
I guess it makes as agents can generate tests, since you are taking this route I'd like to see agents that act as a users, that can only access docs, textbooks, user forums and builds.
Yes this is cool. I actually have worked on a similar project with a slightly worse test oracle and would gladly never have to do that sort of work again. Just tedious unfulfilling work. Though we caught issues with both the specifications/test oracle when doing the work. Also many of the team members learned and are now SMEs for related systems.
Is this evidence that knowledge work is dead or AGI is coming? Absolutely not. I think you’d be pretty ignorant with respect to the field to suggest such a thing.
This is absolutely false and I wish the people doing these demonstrations were more honest.
It had access to GCC! Not only that, using GCC as an oracle was critical and had to be built in by hand.
Like the web browser project this shows how far you can get when you have a reference implementation, good benchmarks, and clear metrics. But that's not the real world for 99% of people, this is the easiest scenario for any ML setting.
Please fix.. :)
Look at this: https://github.com/7mind/jopa
Worse than "-O0" takes skill...
So then, it produced something much worse than tcc (which is better than gcc -O0), an equivalent of which one man can produce in under two weeks. So even all those tokens and dollars did not equal one man's week of work.
Except the one man might explain such arbitrary and shitty code as this:
https://github.com/anthropics/claudes-c-compiler/blob/main/s...
why x9? who knows?!
Oh god the more i look at this code the happier I get. I can already feel the contracts coming to fix LLM slop like this when any company who takes this seriously needs it maintained and cannot...
What? Didn’t cursed lang do something similar like 6 or 7 months ago? These bombastic marketing tactics are getting tired.
So, essentially to build something for which many, many examples already exist on the web, and which is likely baked into its training set somehow ... mmmyeah.
Microsoft, OpenAI, Anthropic, XAI, all solving the wrong problems, your problems not the collective ones.
No, I did not read the article...
I guess if it only created 1.000 lines it would be easy to see where those lines came from.