> Qwen3-Coder is available in multiple sizes, but we’re excited to introduce its most powerful variant first
I'm most excited for the smaller sizes because I'm interested in locally-runnable models that can sometimes write passable code, and I think we're getting close. But since for the foreseeable future, I'll probably sometimes want to "call in" a bigger model that I can't realistically or affordably host on my own computer, I love having the option of high-quality open-weight models for this, and I also like the idea of "paying in" for the smaller open-weight models I play around with by renting access to their larger counterparts.
Congrats to the Qwen team on this release! I'm excited to try it out.
At my work, here is a typical breakdown of time spent by work areas for a software engineer. Which of these areas can be sped up by using agentic coding?
05%: Making code changes
10%: Running build pipelines
20%: Learning about changed process and people via zoom calls, teams chat and emails
15%: Raising incident tickets for issues outside of my control
20%: Submitting forms, attending reviews and chasing approvals
20%: Reaching out to people for dependencies, following up
10%: Finding and reading up some obscure and conflicting internal wiki page, which is likely to be outdated
I've been using it all day, it rips. Had to bump up toolcalling limit in cline to 100 and it just went through the app no issues, got the mobile app built, fixed throug hthe linter errors... wasn't even hosting it with the toolcall template on with the vllm nightly, just stock vllm it understood the toolcall instructions just fine
This suggests adding a `QWEN.md` in the repo for agents instructions.
Where are we with `AGENTS.md`? In a team repo it's getting ridiculous to have a duplicate markdown file for every agent out there.
I tried using the "fp8" model through hyperbolic but I question if it was even that model. It was basically useless through hyperbolic.
I downloaded the 4bit quant to my mac studio 512GB. 7-8 minutes until first tokens with a big Cline prompt for it to chew on. Performance is exceptional. It nailed all the tool calls, loaded my memory bank, and reasoned about a golang code base well enough to write a blog post on the topic: https://convergence.ninja/post/blogs/000016-ForeverFantasyFr...
Writing blog posts is one of the tests I use for these models. It is a very involved process including a Q&A phase, drafting phase, approval, and deployment. The filenames follow a certain pattern. The file has to be uploaded to s3 in a certain location to trigger the deployment. It's a complex custom task that I automated.
Even the 4bit model was capable of this, but was incapable of actually working on my code, prefering to halucinate methods that would be convenient rather than admitting it didn't know what it was doing. This is the 4 bit "lobotomized" model though. I'm excited to see how it performs at full power.
How does one keep up with all this change? I wish we could fast-forward like 2-3 years to see if an actual winner has landed by then. I feel like at that point there will be THE tool, with no one thinking twice about using anything else.
What sort of hardware will run Qwen3-Coder-480B-A35B-Instruct?
With the performance apparently comparable to Sonnet some of the heavy Claude Code users could be interested in running it locally. They have instructions for configuring it for use by Claude Code. Huge bills for usage are regularly shared on X, so maybe it could even be economical (like for a team of 6 or something sharing a local instance).
Glad to see everyone centering on using OpenHands [1] as the scaffold! Nothing more frustrating than seeing "private scaffold" on a public benchmark report.
Anybody knows if one can find an inference provider that offers input token caching? It should be almost required for agentic use - first speed, but also almost all conversations start where the previous ended, so cost may end up quite higher with no caching.
I would have expected good providers like Together, Fireworks, etc support it, but I can't find it, except if I run vllm myself on self-hosted instances.
Wow, these companies in the llm field is so quick to catch up. From everyone offering their own chat model to openai-compitable schema to allowing extensions and IDEs do the work to agentic tasks and now most of them offering their own cli
Thank god I already made an Alibaba Cloud account last year because this interface sucks big time. At least you get 1 mio. tokens free (once?). Bit confusing that they forked the Gemini CLI but you still have to set environment variables for OpenAI?
So far none of these models can write even a slightly complicated function well for me. I tried Mistral, ChatGPT, Qwen Coder 2, Claude, ... they apparently all fail when the solution requires to make use of continuations and such. Probably, because they don't have enough examples in their training data or something.
Example: Partition a linked list in linear time. None of these models seems to be able to get, that `reverse` or converting the whole list to a vector are in themselves linear operations and therefore forbid themselves. When you tell them to not use those, they still continue to do so and blatantly claim, that they are not using them. Á la:
"You are right, ... . The following code avoids using `reverse`, ... :
[code that still uses reverse]"
And in languages like Python they will cheat, because Python's list is more like an array, where random access is O(1).
This means they only work well, when you are doing something quite mainstream, where the amount of training data is a significantly strong signal in the noise. But even there they often struggle. For example I found them somewhat useful for doing Django things, but just as often they gave bullshit code, or it took a lot of back and forth to get something useful out of them.
I think it is embarrassing, that with sooo much training data, they are still unable to do much more than going by frequency in training data when suggesting "solutions". They are "learning" differently than a human being. When a human being sees a new concept, they can often apply that new concept, even if that concept does not happen to be needed that often, as long as they remember the concept. But in these LLMs it seems they deem everything that isn't mainstream irrelevant.
I checked this website along with API pricing on OpenRouter, and this one beats Gemini 2.5 Pro (…Preview-0506 in their chart, but with a good margin so probably the non-preview too) at half Google’s API price. Nice. Admittedly their own posted benchmark, but still. If it even just competes with it, it’s a win.
Edit:
I ran my fun test on it and it unfortunately failed.
> ”How can I detect whether a user is running in a RemoteApp context using C# and .NET? That is, not a full RDP desktop session, but a published RemoteApp as if the app is running locally. The reason I’m asking is that we have an unfortunate bug in a third party library that only shows up in this scenario, and needs a specific workaround when it happens.”
It started by trying to read hallucinated environment variables that just aren’t there. Gemini 2.5 Pro had the same issue and IIRC also Claude.
The only one I have seen give the correct answer that is basically ”You can’t. There’s no official method to do this and this is intentional by Microsoft.” along with a heuristic to instead determine the root launching process which is thus far (but not guaranteed to be) RDPINIT.EXE rather than EXPLORER.EXE as in typical desktop or RDP scenarios… has been OpenAI o3. o3 also provided additional details about the underlying protocol at play here which I could confirm with external sources to be correct.
I like my query because it forces the LLM to actually reply with that you just can’t do this, there’s no ”sign” of it other than going by a completely different side-effect. They are usually too eager to try to figure out a positive reply and hallucinate in the process. Often, there _are_ these env vars to read in cases like these, but not here.
Qwen3-Coder: Agentic coding in the world
(qwenlm.github.io)760 points by danielhanchen 22 July 2025 | 362 comments
Comments
Also docs on running it in a 24GB GPU + 128 to 256GB of RAM here: https://docs.unsloth.ai/basics/qwen3-coder
I'm most excited for the smaller sizes because I'm interested in locally-runnable models that can sometimes write passable code, and I think we're getting close. But since for the foreseeable future, I'll probably sometimes want to "call in" a bigger model that I can't realistically or affordably host on my own computer, I love having the option of high-quality open-weight models for this, and I also like the idea of "paying in" for the smaller open-weight models I play around with by renting access to their larger counterparts.
Congrats to the Qwen team on this release! I'm excited to try it out.
https://github.com/QwenLM/qwen-code https://github.com/QwenLM/qwen-code/blob/main/LICENSE
I hope these OSS CC clones converge at some point.
Actually it is mentioned in the page:
05%: Making code changes
10%: Running build pipelines
20%: Learning about changed process and people via zoom calls, teams chat and emails
15%: Raising incident tickets for issues outside of my control
20%: Submitting forms, attending reviews and chasing approvals
20%: Reaching out to people for dependencies, following up
10%: Finding and reading up some obscure and conflicting internal wiki page, which is likely to be outdated
I downloaded the 4bit quant to my mac studio 512GB. 7-8 minutes until first tokens with a big Cline prompt for it to chew on. Performance is exceptional. It nailed all the tool calls, loaded my memory bank, and reasoned about a golang code base well enough to write a blog post on the topic: https://convergence.ninja/post/blogs/000016-ForeverFantasyFr...
Writing blog posts is one of the tests I use for these models. It is a very involved process including a Q&A phase, drafting phase, approval, and deployment. The filenames follow a certain pattern. The file has to be uploaded to s3 in a certain location to trigger the deployment. It's a complex custom task that I automated.
Even the 4bit model was capable of this, but was incapable of actually working on my code, prefering to halucinate methods that would be convenient rather than admitting it didn't know what it was doing. This is the 4 bit "lobotomized" model though. I'm excited to see how it performs at full power.
With the performance apparently comparable to Sonnet some of the heavy Claude Code users could be interested in running it locally. They have instructions for configuring it for use by Claude Code. Huge bills for usage are regularly shared on X, so maybe it could even be economical (like for a team of 6 or something sharing a local instance).
[1] https://github.com/All-Hands-AI/OpenHands
How casually we enter the sci-fi era.
[0] https://github.com/musistudio/claude-code-router
Open, small, if the benchmarks are to be believed sonnet 4~ish, tool use?
I would have expected good providers like Together, Fireworks, etc support it, but I can't find it, except if I run vllm myself on self-hosted instances.
Not quite as good as Claude but by the best Qwen model so far and 2x as fast as qwen3-235b-a22b-07-25
Specific results for qwen3-coder here: https://llm-benchmark.tinybird.live/models/qwen3-coder
Alibaba Plus: input: $1 to $6 output: $5 to $60
Alibaba OpenSource: input: $1.50 to $4.50 output: $7.50 to $22.50
So it doesn't look that cheap comparing to Kimi k2 or their non coder version (Qwen3 235B A22B 2507).
What's more confusing this "up to" pricing that supposed to can reach $60 for output - with agents it's not that easy to control context.
Example: Partition a linked list in linear time. None of these models seems to be able to get, that `reverse` or converting the whole list to a vector are in themselves linear operations and therefore forbid themselves. When you tell them to not use those, they still continue to do so and blatantly claim, that they are not using them. Á la:
"You are right, ... . The following code avoids using `reverse`, ... :
[code that still uses reverse]"
And in languages like Python they will cheat, because Python's list is more like an array, where random access is O(1).
This means they only work well, when you are doing something quite mainstream, where the amount of training data is a significantly strong signal in the noise. But even there they often struggle. For example I found them somewhat useful for doing Django things, but just as often they gave bullshit code, or it took a lot of back and forth to get something useful out of them.
I think it is embarrassing, that with sooo much training data, they are still unable to do much more than going by frequency in training data when suggesting "solutions". They are "learning" differently than a human being. When a human being sees a new concept, they can often apply that new concept, even if that concept does not happen to be needed that often, as long as they remember the concept. But in these LLMs it seems they deem everything that isn't mainstream irrelevant.
Edit:
I ran my fun test on it and it unfortunately failed.
> ”How can I detect whether a user is running in a RemoteApp context using C# and .NET? That is, not a full RDP desktop session, but a published RemoteApp as if the app is running locally. The reason I’m asking is that we have an unfortunate bug in a third party library that only shows up in this scenario, and needs a specific workaround when it happens.”
It started by trying to read hallucinated environment variables that just aren’t there. Gemini 2.5 Pro had the same issue and IIRC also Claude.
The only one I have seen give the correct answer that is basically ”You can’t. There’s no official method to do this and this is intentional by Microsoft.” along with a heuristic to instead determine the root launching process which is thus far (but not guaranteed to be) RDPINIT.EXE rather than EXPLORER.EXE as in typical desktop or RDP scenarios… has been OpenAI o3. o3 also provided additional details about the underlying protocol at play here which I could confirm with external sources to be correct.
I like my query because it forces the LLM to actually reply with that you just can’t do this, there’s no ”sign” of it other than going by a completely different side-effect. They are usually too eager to try to figure out a positive reply and hallucinate in the process. Often, there _are_ these env vars to read in cases like these, but not here.
I wonder if there's a python expert that can be isolated.