Devstral Hackernews Viewer - by Brendan Jarvis

Devstral

676 points by mfiguiere 21 May 2025 | 141 comments

Comments

simonw 21 May 2025

The first number I look at these days is the file size via Ollama, which for this model is 14GB https://ollama.com/library/devstral/tags

I find that on my M2 Mac that number is a rough approximation to how much memory the model needs (usually plus about 10%) - which matters because I want to know how much RAM I will have left for running other applications.

Anything below 20GB tends not to interfere with the other stuff I'm running too much. This model looks promising!

oofbaroomf 21 May 2025

The SWE-Bench scores are very, very high for an open source model of this size. 46.8% is better than o3-mini (with Agentless-lite) and Claude 3.6 (with AutoCodeRover), but it is a little lower than Claude 3.6 with Anthropic's proprietary scaffold. And considering you can run this for almost free, this is a very extraordinary model.

dismalaf 21 May 2025

It's nice that Mistral is back to releasing actual open source models. Europe needs a competitive AI company.

Also, Mistral has been killing it with their most recent models. I pay for Le Chat Pro, it's really good. Mistral Small is really good. Also building a startup with Mistral integration.

johnQdeveloper 21 May 2025

*For people without a 24GB RAM video card, I've got an 8GB RAM one running this model performs OK for simple tasks on ollama but you'd probably want to pay for an API for anything using a large context window that is time sensitive:*

total duration: 35.016288581s load duration: 21.790458ms prompt eval count: 1244 token(s) prompt eval duration: 1.042544115s prompt eval rate: 1193.23 tokens/s eval count: 213 token(s) eval duration: 33.94778571s eval rate: 6.27 tokens/s

total duration: 4m44.951335984s load duration: 20.528603ms prompt eval count: 1502 token(s) prompt eval duration: 773.712908ms prompt eval rate: 1941.29 tokens/s eval count: 1644 token(s) eval duration: 4m44.137923862s eval rate: 5.79 tokens/s

Compared to an API call that finishes in about 20% of the time it feels a bit slow without the recommended graphics card and what not is all I'm saying.

In terms of benchmarks, it seems unusually well tuned for the model size but I suspect its just a case of gaming the measurement by testing against it as part of the development of the model which is not bad in and of itself since I suspect every LLM who is in this space marketed to IT folks does the same thing tbh so its objective enough given that as a rough gauge of "Is this usable?" without heavy time expense testing it.

solomatov 21 May 2025

It's very nice that it has the Apache 2.0 license, i.e. well understood license, instead of some "open weight" license with a lot of conditions.

CSMastermind 21 May 2025

I don't believe the benchmarks they're presenting.

I haven't tried it out yet but every model I've tested from Mistral has been towards the bottom of my benchmarks in a similar place to Llama.

Would be very surprised if the real life performance is anything like they're claiming.

christophilus 22 May 2025

What hardware are y'all using when you run these things locally? I was thinking of pre ordering the Framework desktop[0] for this purpose, but I wouldn't mind having a decent laptop that could run it (ideally Linux).

[0] https://frame.work/desktop

jwr 22 May 2025

My experience with LLMs seems to indicate that the benchmark numbers are more and more detached from reality, at least my reality.

I tested this model with several of my Clojure problems and it is significantly worse than qwen3:30b-a3b-q4_K_M.

I don't know what to make of this. I don't trust benchmarks much anymore.

ddtaylor 21 May 2025

Wow. I was just grabbing some models and I happened to see this one while I was messing with tool support in LLamaIndex. I have an agentic coding thing I threw together and I have been trying different models on it and was looking to throw ReAct at it to bring in some models that don't have tool support and this just pops into existence!

I'm not able to get my agentic system to use this model though as it just says "I don't have the tools to do this". I tried modifying various agent prompts to explicitly say "Use foo tool to do bar" without any luck yet. All of the ToolSpec that I use are annotated etc. Pydantic objects and every other model has figured out how to use these tools.

qwertox 21 May 2025

Maybe the EU should cover the cost of creating this agent/model, assuming it really delivers what it promises. It would allow Mistral to keep focusing on what they do and for us it would mean that the EU spent money wisely.

ics 21 May 2025

Maybe someone here can suggest tools or at least where to look; what are the state-of-the-art models to run locally on relatively low power machines like a MacBook Air? Is there anyone tracking what is feasible given a machine spec?

"Apple Intelligence" isn't it but it would be nice to know without churning through tests whether I should bother keeping around 2-3 models for specific tasks in ollama or if their performance is marginal there's a more stable all-rounder model.

twotwotwo 22 May 2025

Any company in this space outside of the top few should be contributing to the open-source tools (Aider, OpenHands, etc.); that is a better bet than making your own tools from scratch to compete with ones from much bigger teams. A couple folks making harnesses work better with your model might yield improvements faster than a lot of model-tuning work, and might also come out of the process with practical observations about what to work on in the next spin of the model.

Separately, deploying more autonomous agents that just look at an issue or such just seems premature now. We've only just gotten assisted flows kind-of working, and they still get lost--get stuck on not-hard-for-a-human debugging tasks, implement Potemkin 'fixes', forget their tools, make unrelated changes that sometimes break stuff, etc.--in ways that imply that flow isn't fully-baked yet.

Maybe the main appeal is asynchrony/potential parallelism? You could tackle that different ways, though. And SWEBench might be a good benchmark still (focus on where you want to be, even if you aren't there yet), but that doesn't mean it represents the most practical way to use these tools day-to-day currently.

screye 22 May 2025

What's the play for smaller base model training companies like Mistral ?

Mistral's positioning as the European alternative doesn't seem to be sticking. Acquisition seems tricky given how inflection, character.ai and stability have got carved out. The big acquisition bucks are going to product companies (windsurf)

They could pivot up the stack, but then they'd be starting from scratch with a team that's ill-suited for product development.

The base model offerings from pretraining companies have been surprisingly myopic. Deepmind seems to be the only one going past the obvious "content gen/coding automation" verticals. There's a whole world out there. LLM product companies are fast acquiring pieces of the real money pie and smaller pretraining companies are getting left out.

______

edit: my comment rose to the top. It's early in the morning. Injecting a splash of optimism.

LLMs are hard, and giants like Meta are struggling to make steady progress. Mistrals models are cheap, competent, open-source-ish and don't come with AGI-is-imminent baggage. Good enough for me.

To my own question: They have a list of target industries at the top. https://mistral.ai/solutions#industry

Good luck to them.

bravura 21 May 2025

And how do the results compare to hosted LLMs like Claude 3.7?

thih9 22 May 2025

> Devstral excels at using tools to explore codebases

As an AI and vibe coding newbie, how does that work? E.g. how would I use devstral and ollama and instruct it to use tools? Or would I need some other program as well?

YetAnotherNick 21 May 2025

The SWE bench is super impressive of model of any size. However just providing one benchmark results and having to do partnership with OpenHands seems like they focused too much on optimizing the number.

gyudin 21 May 2025

Super weird benchmarks

abrowne2 21 May 2025

Curious to check this out, since they say it can run on a 4090 / Mac with >32 GB of RAM.

jadbox 21 May 2025

But how does it compare to deepcoder?

AnhTho_FR 21 May 2025

Impressive performance!

anonym29 22 May 2025

I know it's not the recommended runner (OpenHands), but running this on Cline (ollama back-end), it seemed absolutely atrocious at file reading and tool calling.

TZubiri 21 May 2025

I feel this is part of a larger and very old business trend.

But do we need 20 companies copying each other and doing the same thing?

Like, is that really competition? I'd say competition is when you do something slightly different, but I guess it's subjective based on your interpretation of what is a commodity and what is proprietary.

To my view, everyone is outright copying and creating commodity markets:

OpenAI: The OG, the Coke of Modern AI

Claude: The first copycat, The Pepsi of Modern AI

Mistral: Euro OpenAI

DeepSeek: Chinese OpenAI

Grok/xAI: Republican OpenAI

Google/MSFT: OpenAI clone as a SaaS or Office package.

Meta's Llama: Open Source OpenAI

etc...

ManlyBread 21 May 2025

>Devstral is light enough to run on a single RTX 4090 or a Mac with 32GB RAM, making it an ideal choice for local deployment and on-device use

This is still too much, a single 4090 costs $3k