Qwen3-Max-Thinking Hackernews Viewer

Qwen3-Max-Thinking

502 points by vinhnx 26 January 2026 | 424 comments

Comments

roughly 26 January 2026

One thing I’m becoming curious about with these models are the token counts to achieve these results - things like “better reasoning” and “more tool usage” aren’t “model improvements” in what I think would be understood as the colloquial sense, they’re techniques for using the model more to better steer the model, and are closer to “spend more to get more” than “get more for less.” They’re still valuable, but they operate on a different economic tradeoff than what I think we’re used to talking about in tech.

torginus 26 January 2026

It just occured to me that it underperforms Opus 4.5 on benchmarks when search is not enabled, but outperforms it when it is - is it possible the the Chinese internet has better quality content available?

My problem with deep research tends to be that what it does is it searches the internet, and most of the stuff it turns up is the half baked garbage that gets repeated on every topic.

isusmelj 26 January 2026

I just wanted to check whether there is any information about the pricing. Is it the same as Qwen Max? Also, I noticed on the pricing page of Alibaba Cloud that the models are significantly cheaper within mainland China. Does anyone know why? https://www.alibabacloud.com/help/en/model-studio/models?spm...

syntaxing 26 January 2026

Hacker News strongly believes Opus 4.5 is the defacto standard and China was consistently 8+ month behind. Curious how this performs. It’ll be a big inflection point if it performs as well as its benchmarks.

boutell 27 January 2026

The most important benchmark:

https://boutell.dev/misc/qwen3-max-pelican.svg

I used Simon Willison's usual prompt.

It thought for over 2 minutes (free account). The commentary was even more glowing than the image.

It has a certain charm.

siliconc0w 26 January 2026

I don't see a hugging face link, is Qwen no longer releasing their models?

ezekiel68 26 January 2026

Last autumn I tried Qwen3-coder via CLI agents like trae to help add significant advanced features to a rust codebase. It consistently outperformed (at the time) Gemini 2.5 Pro and Claude Opus 3.5 with its ability to generate and re-factor code such that the system stayed coherent and improved performance and efficiency (this included adding Linux shared-memory IPC calls and using x86_64 SIMD intrinsics in rust).

I was very impressed, but I racked up a big bill (for me, in the hundreds of dollars per month) because I insisted on using the Alibaba provider to get the highest context window size and token cache.

mohsen1 26 January 2026

Is this available on Open Router yet? I want it to go head-to-head against Gemini 3 Flash which is the king of playing Mafia so far

https://mafia-arena.com

arendtio 26 January 2026

> By scaling up model parameters and leveraging substantial computational resources

So, how large is that new model?

throwaw12 26 January 2026

Aghhh, I wished they release a model which outperforms Opus 4.5 in agentic coding in my earlier comments, seems I should wait more. But I am hopeful

deepakkumarb 27 January 2026

I get that these approaches work, and they’re totally valid engineering trade-offs. But I don’t think they’re the same thing as real model improvements. If we’re just throwing more tokens, longer chains of thought, or extra tools at the problem, that feels more like brute force than genuine progress.

And that distinction matters in practice. If getting slightly better answers means using 5–10× more tokens or a bunch of external calls, the costs add up fast. That doesn’t scale well in the real world. It’s hard to call something a breakthrough when quality goes up but the bill and latency go up just as much.

I also think we should be careful about reading too much into benchmarks. A lot of them reward clever prompting and tool orchestration more than actual general intelligence. Once you factor in reliability, speed, and cost, the story often looks less impressive.

DeathArrow 26 January 2026

Mandatory pelican on bicycle: https://www.svgviewer.dev/s/U6nJNr1Z

Alifatisk 26 January 2026

Can't wait for the benchmark at artificial analysis. Qwen team doesn't seem to have updated the information about this new model yet https://chat.qwen.ai/settings/model. I tried getting an api key from alibabacloud, but the amount of steps from creating an account made me stop, it was too much. It should be this difficult.

Incredible work anyways!

ytrt54e 26 January 2026

I cannot even open the page; maybe I am blacklisted for asking about Tiananmen Square when their AI first hit the news?

gcr 26 January 2026

Is there an open-source release accompanying this announcement or is this a proprietary model for the time being?

treefry 26 January 2026

Are they likely to take a new strategy that they no longer open source their largest and strongest models?

Mashimo 26 January 2026

I tried to search, could not find anything, do they offer subscriptions? Or only pay per tokens?

pier25 26 January 2026

Tried it and it's super slow compared to others LLMs.

I imagine the Alibaba infra is being hammered hard.

dajonker 26 January 2026

These LLM benchmarks are like interviews for software engineers. They get drilled on advanced algorithms for distributed computing and they ace the questions. But then it turns out that the job is to add a button the user interface and it uses new tailwind classes instead of reusing the existing ones so it is just not quite right.

jbverschoor 26 January 2026

"As of January 2026, Apple has not released an iPhone 17 series. Apple typically announces new iPhones in September each year, so the iPhone 17 series would not be available until at least September 2025 (and we're currently in January 2026). The most recent available models would be the iPhone 16 series."

Hmmmm ok

lysace 26 January 2026

I tried it at https://chat.qwen.ai/.

Prompt: "What happened on Tiananmen square in 1989?"

Reply: "Oops! There was an issue connecting to Qwen3-Max. Content Security Warning: The input text data may contain inappropriate content."

igravious 26 January 2026

The title of the article is: “Pushing Qwen3-Max-Thinking Beyond its Limits”

ndom91 26 January 2026

Not released on Huggingface? :sadge:

elinear 26 January 2026

Benchmarks pasted here, with top scores highlighted. Overall Qwen Max is pretty competitive with the others here.

  Capability                            Benchmark           GPT-5.2-Thinking   Claude-Opus-4.5   Gemini 3 Pro   DeepSeek V3.2   Qwen3-Max-Thinking
  Knowledge                             MMLUPro             87.4               89.5              *89.8*         85.0            85.7            
  Knowledge                             MMLURedux           95.0               95.6              *95.9*         94.5            92.8            
  Knowledge                             CEval               90.5               92.2              93.4           92.9            *93.7*      
  STEM                                  GPQA                *92.4*             87.0              91.9           82.4            87.4           
  STEM                                  HLE                 35.5               30.8              *37.5*         25.1            30.2           
  Reasoning                             LiveCodeBench v6    87.7               84.8              *90.7*         80.8            85.9           
  Reasoning                             HMMT Feb 25         *99.4*             -                 97.5           92.5            98.0            
  Reasoning                             HMMT Nov 25         -                  -                 93.3           90.2            *94.7*      
  Reasoning                             IMOAnswerBench      *86.3*             84.0              83.3           78.3            83.9           
  Agentic Coding                        SWE Verified        80.0               *80.9*            76.2           73.1            75.3           
  Agentic Search                        HLE (w/ tools)      45.5               43.2              45.8           40.8            *49.8*     
  Instruction Following & Alignment     IFBench             *75.4*             58.0              70.4           60.7            70.9           
  Instruction Following & Alignment     MultiChallenge      57.9               54.2              *64.2*         47.3            63.3           
  Instruction Following & Alignment     ArenaHard v2        80.6               76.7              81.7           66.5            *90.2*      
  Tool Use                              Tau² Bench          80.9               *85.7*            85.4           80.3            82.1           
  Tool Use                              BFCLV4              63.1               *77.5*            72.5           61.2            67.7            
  Tool Use                              Vita Bench          38.2               *56.3*            51.6           44.1            40.9           
  Tool Use                              Deep Planning       *44.6*             33.9              23.3           21.6            28.7           
  Long Context                          AALCR               72.7               *74.0*            70.7           65.0            68.7

pmarreck 26 January 2026

I asked it about "Chinese cultural dishonesty" (such as the 2019 wallet experiment, but wait for it...) and it probably had the most fascinating and subtle explanation of it I've ever read. It was clearly informed by Chinese-language sources (which in this case was good... references to Confucianism etc.) and I have to say that this is the first time I feel more enlightened about what some Westerners may perceive as a real problem.

I wasn't logged in so I don't have the ability to link to the conversation but I'm exporting it for my records.

diblasio 26 January 2026

[flagged]

sacha1bu 27 January 2026

Great to see reasoning taken seriously — Qwen3-Max-Thinking exposing explicit reasoning steps and scoring 100% on tough benchmarks is a big deal for complex problem solving. Looking forward to seeing how this changes real-world coding and logic tasks.

airstrike 26 January 2026

2026 will be the year of open and/or small models.

sciencesama 26 January 2026

what ram and what minimum system req do you need to run this on personal systems !

3ds 27 January 2026

What is the tiananmen massacre?

> Oops! There was an issue connecting to Qwen3-Max.

> Content Security Warning: The input text data may contain inappropriate content.

xcodevn 26 January 2026

I'm not familiar with these open-source models. My bias is that they're heavily benchmaxxing and not really helpful in practice. Can someone with a lot of experience using these, as well as Claude Opus 4.5 or Codex 5.2 models, confirm whether they're actually on the same level? Or are they not that useful in practice?

P.S. I realize Qwen3-Max-Thinking isn't actually an open-weight model (only accessible via API), but I'm still curious how it compares.