VibeThinker: 3B param model that beats Opus 4.5 on reasoning with novel SFT+GRPO Hackernews Viewer

VibeThinker: 3B param model that beats Opus 4.5 on reasoning with novel SFT+GRPO

362 points by timhigins 18 hours ago | 188 comments

Comments

secretslol 16 hours ago

Am I right in thinking this is a tiny model which has been trained well to reason, and that's it? Makes me think of a smart person who doesn't know anything about a given topic, but with the right tools will go and research the heck out of it. I really like the sound of this... why have models train on learning anything when you can just train them how to learn and let them get on with it from something as small as a Pi Zero and an internet connection.

rbbydotdev 9 hours ago

Looks like we are seeing small but mighty model breakthroughs, outpacing the pure capital firepower of SOTA providers. I love rooting for the little guy, but is it too soon to call it? To play devils advocate, could it just be the benchmarks are not efficient enough to capture success of real developer workflows?

troglodytetrain 31 minutes ago

Sounds like something that could be pretty useful as a 'validation' subagent. Provide it the details/context related to a larger LLM's run or turn in a harness and have it act as a gatekeeper. At this size and speed it looks like it could be economical to have it run every turn or even every tool call and inform the main agent about the result and success/failure.

deftio 16 hours ago

There is some base level of intelligence any model needs to be useful, even in narrow tasks.

Could you teach a 5 year old to drive a car? A 10 year old? A 12 year old? To drive a car requires being able to read, to have judgement about ice or rainy conditions, to anticipate a child running after a ball. By the time a human in in their mid teens they have acquired the base knowledge...

Small models need to have enough base knowledge to be able to be good enough -- even in a seemingly narrow regime. Where is that? Obviously they don't need all the obscure knowledge of a frontier model but there is some base level which is probably more than it would first seem.

mvitorino 55 minutes ago

Really enjoying seeing these really capable SMLs. Note that on HF they state: "This model was not trained on tool-calling or agent-based programming data. We therefore do not recommend using it for tasks that involve function calling, API orchestration, or autonomous coding agents." - https://huggingface.co/WeiboAI/VibeThinker-3B So we can't just hook it up to a coding harness like pi.dev or something.

gslepak 16 hours ago

Note that these are Python-only results, the model will not do as well with other languages.

I'm glad to see more domain-focused SLMs, we need more of them! A programming focused MoE should work well across many languages.

NotSuspicious 15 hours ago

The interesting thing about models this small is they should be able to be put on a single Taalas chip (the HC1 already runs a Llama 3.1 8B model). We're already at the point where half-decent reasoning could be run on an ASIC (and at mind-boggling speeds).

noperator 17 hours ago

Having some success while testing this model out as a replacement for GPT-5 nano in source code security review. Running on RTX 3090 (24 GB VRAM) via vLLM. It's not great on structured output (as noted in the model card) but I'm working around that in my harness.

delis-thumbs-7e 1 hour ago

I gave this a run on llama.cpp locally. My GPU is Ge1080, so I needed quantized version for even such a small model and…

This. Is. Amazing. I am flabbergasted.

I am not into the whole GenAI thing and I have very little need for anything agentic, but Python, C++ and Maths is exactly what I mostly used these for, so this might actually become my main work horse. This is so cool.

I even used it for stuff it is not built for, asking complex qustion on history (Battle of Tours 732) and literature (Joyce’s Portrait of Artist) and it was surprisingly good, even though it started to hallucinate names and details (such as claiming Joyce’s father was a priest). For 3B I expected it to mainly spout complete nonsense.

tracerbulletx 1 hour ago

Man just need something like this with tool calling.

yousif_123123 6 hours ago

I really hope that in a couple of years I can have a laptop that runs a reasonably good coding agent locally, that I can run fast and do most of my programming with, without running my laptop hot. I could keep open code and use other models when needed, but really for most of my work, I'm already breaking it down so that I can review code changes eventually, and I just need something reasonably decent and fast and unlimited. I think its coming.

sorenjan 7 hours ago

How would you best utilize a model like this for coding? I take it it's not meant for vibe coding a full app, and the reasoning probably makes it unsuitable for autocomplete. Would you use it to implement specific functions? I looked at one of the coding benchmarks used, Live Code Bench, and it seems to be problem descriptions with sample input and output, and then a solution with a single function or class.

Seems like a really good model to use in an IDE when you still want control over the code structure then.

aero2146 17 hours ago

I tried generating the classic pelican svg, but it failed horribly just showing me a rectangle and a black circle...

nolist_policy 5 hours ago

Notable:

  VibeThinker-3B is developed through a staged post-training pipeline built upon Qwen2.5-Coder-3B base, a compact 3B foundation model.

Qwen2.5 is ancient by LLM standards.

virajk_31 8 hours ago

SLM when trained for single use case often beats the LLM. That's both the advantage and limitation.

achrono 5 hours ago

Beats Opus 4.5 on reasoning you say?

Prompt: If A goes to B who then goes to C, can A send something to C?

Response:

We need to interpret best. The phrase "If A goes to B who then goes to C, can A send something to C?" could be a puzzle about the concept of sending something (like passing a ball) and the relationships.

Scenario: A gives something to B, and B passes it on to C. Question: Can A also give the same thing to C? Answer: Only if A can obtain a second copy (e.g., the thing was duplicated). Otherwise, after handing it to B, A no longer holds it and cannot “send” it unless a copy exists.

[Lots of other unnecessary commentary and "scenarios" that make even lesser sense]

androiddrew 9 hours ago

I have been thinking about how to use this. Since it doesn’t support tool calling I have been considering a dual model deployment, where a small tool calling llm drives the majority of the user experience, and vibe thinker is tapped for reasoning by the other llm.

So who has suggestions on small models with excellent tool calling capabilities?

andai 3 hours ago

I tried actually talking to it. It reminded me of GPT-2.

SwellJoe 16 hours ago

It's terrible at hunting security bugs (I expected it to be, but I wanted to be sure). I added it to a benchmark I made with a corpus of some Mythos-discovered bugs, and it found zero. The smallest pretty successful models remain Qwen 3.6 and Gemma 4 (but I haven't tested the very small variants of those yet).

https://swelljoe.com/post/will-it-mythos/

iamgopal 7 hours ago

Two model, one is optimised for system, reasoning etc, second is optimised for specific language ( rust or go ? ) , both small enough to run on local computer, will it work ?

jpcompartir 6 hours ago

The absolute worst name for a model I've seen

brainless 12 hours ago

I recently came across this model and I would love to try it with my coding agent soon.

I really like the idea of small models that can reason but do not have too much knowledge. Also, no emphasis on tool calls. I think the agent should do the heavy lifting and reach half way.

I use really small models, like Qwen 3.5 0.8B to 9B - no tool calling, no MCP, no skills, nothing. No multi-turn chat even. Models are given very specific tasks using a vast number of system prompts and all the response handling is done in the agent(s).

https://github.com/brainless/nocodo

makethembroke 2 hours ago

I don't get this beating opus, It just hardcoded the tasks for bench , It does even respond normally

A alot randomness in it

Please don't hype

cold_harbor 7 hours ago

GRPO skips the value network that makes PPO expensive — it scores candidates relative to each other within a group. that's what makes verifiable-reward training practical at 3B scale

uberex 9 hours ago

What is the idiots guude to run this one local now?

unfirehose 8 hours ago

this is a good model. I benchmark reasoned answers to qwen 3.6 27b (no think)+ bash and it held up.

diimdeep 5 hours ago

BF16 with no QAT quants == half backed bread

scotty79 12 hours ago

If you could pair it somehow with a model that can code and describe code this could be a very powerful combo.

anonyfox 13 hours ago

Wake me up when it does OCaml fine.

4gotunameagain 7 hours ago

What are the implications of local SOTA inference, given the insane datacenter "investing" ?

It surely cannot be justified only for training at this scale, and since models nowadays are improved more and more by fine tuning than re-training from scratch.

Will a viable local model crash the US economy ?

More importantly, are the LLM companies aware, and are they deliberately buying out all the RAM and GPUs in order to prolong the inevitable ? Probably not, but I wouldn't be surprised if that is the case.

maxignol 11 hours ago

3B param on par with opus 4.5 sounds interesting. Will read the full article before making my mind

zkmon 14 hours ago

Does python coding depend on political facts of the world?

It might appear not, but actually, the process of reasoning is not an isolated act. The right and wrong way of doing things is codified in social evolution that absorbed all facets of life. Why should you optimize a piece of code for performance? Why performance is needed? What is a bug? What features and UI themes would be more intuitive for humans?

There is a butterfly effect. Everything affects everything to some extent.

kmchandy 5 hours ago

The paper makes a clear claim: "it provides an important and concrete proof: on well-constrained, verifiable reasoning tasks, first-tier performance is no longer the exclusive domain of ultra-large models" And that's exciting.