Qwen3-Max-Thinking

(qwen.ai)

Comments

boutell 1 hour ago
The most important benchmark:

https://boutell.dev/misc/qwen3-max-pelican.svg

I used Simon Willison's usual prompt.

It thought for over 2 minutes (free account). The commentary was even more glowing than the image.

It has a certain charm.

roughly 21 hours ago
One thing I’m becoming curious about with these models are the token counts to achieve these results - things like “better reasoning” and “more tool usage” aren’t “model improvements” in what I think would be understood as the colloquial sense, they’re techniques for using the model more to better steer the model, and are closer to “spend more to get more” than “get more for less.” They’re still valuable, but they operate on a different economic tradeoff than what I think we’re used to talking about in tech.
torginus 22 hours ago
It just occured to me that it underperforms Opus 4.5 on benchmarks when search is not enabled, but outperforms it when it is - is it possible the the Chinese internet has better quality content available?

My problem with deep research tends to be that what it does is it searches the internet, and most of the stuff it turns up is the half baked garbage that gets repeated on every topic.

isusmelj 23 hours ago
I just wanted to check whether there is any information about the pricing. Is it the same as Qwen Max? Also, I noticed on the pricing page of Alibaba Cloud that the models are significantly cheaper within mainland China. Does anyone know why? https://www.alibabacloud.com/help/en/model-studio/models?spm...
syntaxing 21 hours ago
Hacker News strongly believes Opus 4.5 is the defacto standard and China was consistently 8+ month behind. Curious how this performs. It’ll be a big inflection point if it performs as well as its benchmarks.
siliconc0w 23 hours ago
I don't see a hugging face link, is Qwen no longer releasing their models?
ezekiel68 16 hours ago
Last autumn I tried Qwen3-coder via CLI agents like trae to help add significant advanced features to a rust codebase. It consistently outperformed (at the time) Gemini 2.5 Pro and Claude Opus 3.5 with its ability to generate and re-factor code such that the system stayed coherent and improved performance and efficiency (this included adding Linux shared-memory IPC calls and using x86_64 SIMD intrinsics in rust).

I was very impressed, but I racked up a big bill (for me, in the hundreds of dollars per month) because I insisted on using the Alibaba provider to get the highest context window size and token cache.

mohsen1 21 hours ago
Is this available on Open Router yet? I want it to go head-to-head against Gemini 3 Flash which is the king of playing Mafia so far

https://mafia-arena.com

arendtio 23 hours ago
> By scaling up model parameters and leveraging substantial computational resources

So, how large is that new model?

throwaw12 23 hours ago
Aghhh, I wished they release a model which outperforms Opus 4.5 in agentic coding in my earlier comments, seems I should wait more. But I am hopeful
Alifatisk 19 hours ago
Can't wait for the benchmark at artificial analysis. Qwen team doesn't seem to have updated the information about this new model yet https://chat.qwen.ai/settings/model. I tried getting an api key from alibabacloud, but the amount of steps from creating an account made me stop, it was too much. It should be this difficult.

Incredible work anyways!

gcr 17 hours ago
Is there an open-source release accompanying this announcement or is this a proprietary model for the time being?
dajonker 17 hours ago
These LLM benchmarks are like interviews for software engineers. They get drilled on advanced algorithms for distributed computing and they ace the questions. But then it turns out that the job is to add a button the user interface and it uses new tailwind classes instead of reusing the existing ones so it is just not quite right.
ytrt54e 21 hours ago
I cannot even open the page; maybe I am blacklisted for asking about Tiananmen Square when their AI first hit the news?
treefry 21 hours ago
Are they likely to take a new strategy that they no longer open source their largest and strongest models?
jbverschoor 18 hours ago
"As of January 2026, Apple has not released an iPhone 17 series. Apple typically announces new iPhones in September each year, so the iPhone 17 series would not be available until at least September 2025 (and we're currently in January 2026). The most recent available models would be the iPhone 16 series."

Hmmmm ok

pier25 22 hours ago
Tried it and it's super slow compared to others LLMs.

I imagine the Alibaba infra is being hammered hard.

Mashimo 23 hours ago
I tried to search, could not find anything, do they offer subscriptions? Or only pay per tokens?
ndom91 19 hours ago
Not released on Huggingface? :sadge:
elinear 20 hours ago
Benchmarks pasted here, with top scores highlighted. Overall Qwen Max is pretty competitive with the others here.

  Capability                            Benchmark           GPT-5.2-Thinking   Claude-Opus-4.5   Gemini 3 Pro   DeepSeek V3.2   Qwen3-Max-Thinking
  Knowledge                             MMLUPro             87.4               89.5              *89.8*         85.0            85.7            
  Knowledge                             MMLURedux           95.0               95.6              *95.9*         94.5            92.8            
  Knowledge                             CEval               90.5               92.2              93.4           92.9            *93.7*      
  STEM                                  GPQA                *92.4*             87.0              91.9           82.4            87.4           
  STEM                                  HLE                 35.5               30.8              *37.5*         25.1            30.2           
  Reasoning                             LiveCodeBench v6    87.7               84.8              *90.7*         80.8            85.9           
  Reasoning                             HMMT Feb 25         *99.4*             -                 97.5           92.5            98.0            
  Reasoning                             HMMT Nov 25         -                  -                 93.3           90.2            *94.7*      
  Reasoning                             IMOAnswerBench      *86.3*             84.0              83.3           78.3            83.9           
  Agentic Coding                        SWE Verified        80.0               *80.9*            76.2           73.1            75.3           
  Agentic Search                        HLE (w/ tools)      45.5               43.2              45.8           40.8            *49.8*     
  Instruction Following & Alignment     IFBench             *75.4*             58.0              70.4           60.7            70.9           
  Instruction Following & Alignment     MultiChallenge      57.9               54.2              *64.2*         47.3            63.3           
  Instruction Following & Alignment     ArenaHard v2        80.6               76.7              81.7           66.5            *90.2*      
  Tool Use                              Tau² Bench          80.9               *85.7*            85.4           80.3            82.1           
  Tool Use                              BFCLV4              63.1               *77.5*            72.5           61.2            67.7            
  Tool Use                              Vita Bench          38.2               *56.3*            51.6           44.1            40.9           
  Tool Use                              Deep Planning       *44.6*             33.9              23.3           21.6            28.7           
  Long Context                          AALCR               72.7               *74.0*            70.7           65.0            68.7
DeathArrow 23 hours ago
Mandatory pelican on bicycle: https://www.svgviewer.dev/s/U6nJNr1Z
pmarreck 21 hours ago
I asked it about "Chinese cultural dishonesty" (such as the 2019 wallet experiment, but wait for it...) and it probably had the most fascinating and subtle explanation of it I've ever read. It was clearly informed by Chinese-language sources (which in this case was good... references to Confucianism etc.) and I have to say that this is the first time I feel more enlightened about what some Westerners may perceive as a real problem.

I wasn't logged in so I don't have the ability to link to the conversation but I'm exporting it for my records.

diblasio 22 hours ago
[flagged]
sacha1bu 9 hours ago
Great to see reasoning taken seriously — Qwen3-Max-Thinking exposing explicit reasoning steps and scoring 100% on tough benchmarks is a big deal for complex problem solving. Looking forward to seeing how this changes real-world coding and logic tasks.
airstrike 23 hours ago
2026 will be the year of open and/or small models.
igravious 17 hours ago
The title of the article is: “Pushing Qwen3-Max-Thinking Beyond its Limits”
lysace 22 hours ago
I tried it at https://chat.qwen.ai/.

Prompt: "What happened on Tiananmen square in 1989?"

Reply: "Oops! There was an issue connecting to Qwen3-Max. Content Security Warning: The input text data may contain inappropriate content."

sciencesama 22 hours ago
what ram and what minimum system req do you need to run this on personal systems !
xcodevn 22 hours ago
I'm not familiar with these open-source models. My bias is that they're heavily benchmaxxing and not really helpful in practice. Can someone with a lot of experience using these, as well as Claude Opus 4.5 or Codex 5.2 models, confirm whether they're actually on the same level? Or are they not that useful in practice?

P.S. I realize Qwen3-Max-Thinking isn't actually an open-weight model (only accessible via API), but I'm still curious how it compares.

3ds 4 hours ago
What is the tiananmen massacre?

> Oops! There was an issue connecting to Qwen3-Max.

> Content Security Warning: The input text data may contain inappropriate content.