VibeThinker: 3B param model that beats Opus 4.5 on reasoning with novel SFT+GRPO

(arxiv.org)

Comments

secretslol 16 hours ago
Am I right in thinking this is a tiny model which has been trained well to reason, and that's it? Makes me think of a smart person who doesn't know anything about a given topic, but with the right tools will go and research the heck out of it. I really like the sound of this... why have models train on learning anything when you can just train them how to learn and let them get on with it from something as small as a Pi Zero and an internet connection.
rbbydotdev 9 hours ago
Looks like we are seeing small but mighty model breakthroughs, outpacing the pure capital firepower of SOTA providers. I love rooting for the little guy, but is it too soon to call it? To play devils advocate, could it just be the benchmarks are not efficient enough to capture success of real developer workflows?
troglodytetrain 31 minutes ago
Sounds like something that could be pretty useful as a 'validation' subagent. Provide it the details/context related to a larger LLM's run or turn in a harness and have it act as a gatekeeper. At this size and speed it looks like it could be economical to have it run every turn or even every tool call and inform the main agent about the result and success/failure.
deftio 16 hours ago
There is some base level of intelligence any model needs to be useful, even in narrow tasks.

Could you teach a 5 year old to drive a car? A 10 year old? A 12 year old? To drive a car requires being able to read, to have judgement about ice or rainy conditions, to anticipate a child running after a ball. By the time a human in in their mid teens they have acquired the base knowledge...

Small models need to have enough base knowledge to be able to be good enough -- even in a seemingly narrow regime. Where is that? Obviously they don't need all the obscure knowledge of a frontier model but there is some base level which is probably more than it would first seem.

mvitorino 55 minutes ago
Really enjoying seeing these really capable SMLs. Note that on HF they state: "This model was not trained on tool-calling or agent-based programming data. We therefore do not recommend using it for tasks that involve function calling, API orchestration, or autonomous coding agents." - https://huggingface.co/WeiboAI/VibeThinker-3B So we can't just hook it up to a coding harness like pi.dev or something.
gslepak 16 hours ago
Note that these are Python-only results, the model will not do as well with other languages.

I'm glad to see more domain-focused SLMs, we need more of them! A programming focused MoE should work well across many languages.

NotSuspicious 15 hours ago
The interesting thing about models this small is they should be able to be put on a single Taalas chip (the HC1 already runs a Llama 3.1 8B model). We're already at the point where half-decent reasoning could be run on an ASIC (and at mind-boggling speeds).
noperator 17 hours ago
Having some success while testing this model out as a replacement for GPT-5 nano in source code security review. Running on RTX 3090 (24 GB VRAM) via vLLM. It's not great on structured output (as noted in the model card) but I'm working around that in my harness.
delis-thumbs-7e 1 hour ago
I gave this a run on llama.cpp locally. My GPU is Ge1080, so I needed quantized version for even such a small model and…

This. Is. Amazing. I am flabbergasted.

I am not into the whole GenAI thing and I have very little need for anything agentic, but Python, C++ and Maths is exactly what I mostly used these for, so this might actually become my main work horse. This is so cool.

I even used it for stuff it is not built for, asking complex qustion on history (Battle of Tours 732) and literature (Joyce’s Portrait of Artist) and it was surprisingly good, even though it started to hallucinate names and details (such as claiming Joyce’s father was a priest). For 3B I expected it to mainly spout complete nonsense.

tracerbulletx 1 hour ago
Man just need something like this with tool calling.
yousif_123123 6 hours ago
I really hope that in a couple of years I can have a laptop that runs a reasonably good coding agent locally, that I can run fast and do most of my programming with, without running my laptop hot. I could keep open code and use other models when needed, but really for most of my work, I'm already breaking it down so that I can review code changes eventually, and I just need something reasonably decent and fast and unlimited. I think its coming.
sorenjan 7 hours ago
How would you best utilize a model like this for coding? I take it it's not meant for vibe coding a full app, and the reasoning probably makes it unsuitable for autocomplete. Would you use it to implement specific functions? I looked at one of the coding benchmarks used, Live Code Bench, and it seems to be problem descriptions with sample input and output, and then a solution with a single function or class.

Seems like a really good model to use in an IDE when you still want control over the code structure then.

aero2146 17 hours ago
I tried generating the classic pelican svg, but it failed horribly just showing me a rectangle and a black circle...
nolist_policy 5 hours ago
Notable:

  VibeThinker-3B is developed through a staged post-training pipeline built upon Qwen2.5-Coder-3B base, a compact 3B foundation model.
Qwen2.5 is ancient by LLM standards.
virajk_31 8 hours ago
SLM when trained for single use case often beats the LLM. That's both the advantage and limitation.
achrono 5 hours ago
Beats Opus 4.5 on reasoning you say?

Prompt: If A goes to B who then goes to C, can A send something to C?

Response:

We need to interpret best. The phrase "If A goes to B who then goes to C, can A send something to C?" could be a puzzle about the concept of sending something (like passing a ball) and the relationships.

Scenario: A gives something to B, and B passes it on to C. Question: Can A also give the same thing to C? Answer: Only if A can obtain a second copy (e.g., the thing was duplicated). Otherwise, after handing it to B, A no longer holds it and cannot “send” it unless a copy exists.

[Lots of other unnecessary commentary and "scenarios" that make even lesser sense]

androiddrew 9 hours ago
I have been thinking about how to use this. Since it doesn’t support tool calling I have been considering a dual model deployment, where a small tool calling llm drives the majority of the user experience, and vibe thinker is tapped for reasoning by the other llm.

So who has suggestions on small models with excellent tool calling capabilities?

andai 3 hours ago
I tried actually talking to it. It reminded me of GPT-2.
SwellJoe 16 hours ago
It's terrible at hunting security bugs (I expected it to be, but I wanted to be sure). I added it to a benchmark I made with a corpus of some Mythos-discovered bugs, and it found zero. The smallest pretty successful models remain Qwen 3.6 and Gemma 4 (but I haven't tested the very small variants of those yet).

https://swelljoe.com/post/will-it-mythos/

iamgopal 7 hours ago
Two model, one is optimised for system, reasoning etc, second is optimised for specific language ( rust or go ? ) , both small enough to run on local computer, will it work ?
jpcompartir 6 hours ago
The absolute worst name for a model I've seen
brainless 12 hours ago
I recently came across this model and I would love to try it with my coding agent soon.

I really like the idea of small models that can reason but do not have too much knowledge. Also, no emphasis on tool calls. I think the agent should do the heavy lifting and reach half way.

I use really small models, like Qwen 3.5 0.8B to 9B - no tool calling, no MCP, no skills, nothing. No multi-turn chat even. Models are given very specific tasks using a vast number of system prompts and all the response handling is done in the agent(s).

https://github.com/brainless/nocodo

makethembroke 2 hours ago
I don't get this beating opus, It just hardcoded the tasks for bench , It does even respond normally

A alot randomness in it

Please don't hype

cold_harbor 7 hours ago
GRPO skips the value network that makes PPO expensive — it scores candidates relative to each other within a group. that's what makes verifiable-reward training practical at 3B scale
uberex 9 hours ago
What is the idiots guude to run this one local now?
unfirehose 8 hours ago
this is a good model. I benchmark reasoned answers to qwen 3.6 27b (no think)+ bash and it held up.
diimdeep 5 hours ago
BF16 with no QAT quants == half backed bread
scotty79 12 hours ago
If you could pair it somehow with a model that can code and describe code this could be a very powerful combo.
anonyfox 13 hours ago
Wake me up when it does OCaml fine.
4gotunameagain 7 hours ago
What are the implications of local SOTA inference, given the insane datacenter "investing" ?

It surely cannot be justified only for training at this scale, and since models nowadays are improved more and more by fine tuning than re-training from scratch.

Will a viable local model crash the US economy ?

More importantly, are the LLM companies aware, and are they deliberately buying out all the RAM and GPUs in order to prolong the inevitable ? Probably not, but I wouldn't be surprised if that is the case.

maxignol 11 hours ago
3B param on par with opus 4.5 sounds interesting. Will read the full article before making my mind
zkmon 14 hours ago
Does python coding depend on political facts of the world?

It might appear not, but actually, the process of reasoning is not an isolated act. The right and wrong way of doing things is codified in social evolution that absorbed all facets of life. Why should you optimize a piece of code for performance? Why performance is needed? What is a bug? What features and UI themes would be more intuitive for humans?

There is a butterfly effect. Everything affects everything to some extent.

kmchandy 5 hours ago
The paper makes a clear claim: "it provides an important and concrete proof: on well-constrained, verifiable reasoning tasks, first-tier performance is no longer the exclusive domain of ultra-large models" And that's exciting.