I think gemma-3-27b-it-qat-4bit is my new favorite local model - or at least it's right up there with Mistral Small 3.1 24B.
I've been trying it on an M2 64GB via both Ollama and MLX. It's very, very good, and it only uses ~22Gb (via Ollama) or ~15GB (MLX) leaving plenty of memory for running other apps.
Last night I had it write me a complete plugin for my LLM tool like this:
llm install llm-mlx
llm mlx download-model mlx-community/gemma-3-27b-it-qat-4bit
llm -m mlx-community/gemma-3-27b-it-qat-4bit \
-f https://raw.githubusercontent.com/simonw/llm-hacker-news/refs/heads/main/llm_hacker_news.py \
-f https://raw.githubusercontent.com/simonw/tools/refs/heads/main/github-issue-to-markdown.html \
-s 'Write a new fragments plugin in Python that registers
issue:org/repo/123 which fetches that issue
number from the specified github repo and uses the same
markdown logic as the HTML page to turn that into a
fragment'
I have a few private “vibe check” questions and the 4 bit QAT 27B model got them all correctly. I’m kind of shocked at the information density locked in just 13 GB of weights. If anyone at Deepmind is reading this — Gemma 3 27B is the single most impressive open source model I have ever used. Well done!
First graph is a comparison of the "Elo Score" while using "native" BF16 precision in various models, second graph is comparing VRAM usage between native BF16 precision and their QAT models, but since this method is about doing quantization while also maintaining quality, isn't the obvious graph of comparing the quality between BF16 and QAT missing? The text doesn't seem to talk about it either, yet it's basically the topic of the blog post.
Indeed!! I have swapped out qwen2.5 for gemma3:27b-it-qat using Ollama for routine work on my 32G memory Mac.
gemma3:27b-it-qat with open-codex, running locally, is just amazingly useful, not only for Python dev, but for Haskell and Common Lisp also.
I still like Gemini 2.5 Pro and o3 for brainstorming or working on difficult problems, but for routine work it (simply) makes me feel good to have everything open source/weights running on my own system.
Wen I bought my 32G Mac a year ago, I didn't expect to be so happy as running gemma3:27b-it-qat with open-codex locally.
Gemma 3 is way way better than Llama 4. I think Meta will start to lose its position in LLM mindshare. Another weakness of Llama 4 is its model size that is too large (even though it can run fast with MoE), which greatly limits the applicable users to a small percentage of enthusiasts who have enough GPU VRAM. Meanwhile, Gemma 3 is widely usable across all hardware sizes.
It seems pretty impressive - I'm running it on my CPU (16 core AMD 3950x) and it's very very impressive at translation, and the image description is very impressive as well. I'm getting about 2.3token/s on it (compared to under 1/s on the Calme-3.2 I was previously using).
It does tend to be a bit chatty unless you tell it not to be; pretty much everything it'll give you a 'breakdown' unless you tell it not to - so for traslation my prompt is 'Translate the input to English, only output the translation' to stop it giving a breakdown of the input language.
It is funny that Microsoft had been peddling "AI PCs" and Apple had been peddling "made for Apple Intelligence" for a while now, when in fact usable models for consumer GPUs are only barely starting to be a thing on extremely high end GPUs like the 3090.
I am running this on 16 GB AMD Radeon 7900 GRE with 64 GB machine with ROCm and llama.cpp on Windows 11. I can use Open-webui or the native gui for the interface. It is made available via an internal IP to all members of my home.
It runs at around 26 tokens/sec and FP16, FP8 is not supported by the Radeon 7900 GRE.
I just love it.
For coding QwQ 32b is still king. But with a 16GB VRAM card it gives me ~3 tokens/sec, which is unusable.
I tried to make Gemma 3 write a powershell script with Terminal gui interface and it ran into dead-ends and finally gave up. QwQ 32B performed a lot better.
But for most general purposes it is great. My kid's been using it to feed his school textbooks and ask it questions. It is better than anything else currently.
Somehow it is more "uptight" than llama or the chinese models like Qwen. Can't put my finger on it, the Chinese models seem nicer and more talkative.
Just for fun I created a new personal benchmark for vision-enabled LLMs: playing minecraft. I used JSON structured output in LM Studio to create basic controls for the game. Unfortunately no matter how hard I proompted, gemma-3-27b QAT is not really able to understand simple minecraft scenarios. It would say things like "I'm now looking at a stone block. I need to break it" when it is looking out at the horizon in the desert.
What would be the best way to deploy this if you're maximizing for GPU utilization in a multi-user (API) scenario? Structured output support would be a big plus.
We're working with a GPU-poor organization with very strict data residency requirements, and these models might be exactly what we need.
I would normally say VLLM, but the blog post notably does not mention VLLM support.
I don't get the appeal. For LLMs to be useful at all you at least need to bin the the dozen exabit range per token, zettabit/s if you want something usable.
There is really no technological path towards supercomputers that fast in a human timescale and in 100 years.
The thing that makes LLMs usefull is their ability to translate concepts from one domain to the other. Overfitting on choice benchmarks, even a spread, will lower their usefullness in every general task by destorying infomation that is encoded in the weights.
Ask gemma to write a 5 paragraph essay on any niche topic and you will get plenty of statements that have an extremely small likely of existing in relation to the topic, but have a high likely of existing in related more popular topics. ChatGPT less so, but still at least one a paragraph. I'm not talking about factual errors or common oversimplifications. I'm talking about completely unrelated statements. What your asking about is largely outside it's training data of which a 27GB models gives you what? a few hundred Gigs? Seems like alot, but you have to remember that there is a lot of stuff that you probably don't care about that many people do. Stainless steel and Kubernetes are going to be well represented, your favorite media? probably not, relatively current? definitely not. Which sounds fine, until you realize that people who care about Stainless steel and Kubernetes, likely care about some much more specific aspect which isn't going to be represented and you are back to the same problem of low usability.
This is why I believe that scale is king and that both data and compute are the big walls. Google has Youtube data but they are only using it in Gemini.
Definitely my current fav. Also interesting that for many questions the response is very similar to the gemini series. Must be sharing training datasets pretty directly.
I am currently using the Q4_K_M quantized version of gemma-3-27b-it locally. I previously assumed that a 27B model with image input support wouldn't be very high quality, but after actually using it, the generated responses feel better than those from my previously used DeepSeek-R1-Distill-Qwen-32B (Q4_K_M), and its recognition of images is also stronger than I expected. (I thought the model could only roughly understand the concepts in the image, but I didn't expect it to be able to recognize text within the image.)
Since this article publishes the optimized Q4 quantized version, it would be great if it included more comparisons between the new version and my currently used unoptimized Q4 version (such as benchmark scores).
(I deliberately wrote this reply in Chinese and had gemma-3-27b-it Q4_K_M translate it into English.)
Assuming this can match Claude's latest, and full time usage ( as in you have a system that's constantly running code without any user input,) you'd probably save 600 to 700 a month. A 4090 is only 2K and you'll see an ROI within 90 days.
I can imagine this will serve to drive prices for hosted llms lower.
At this level any company that produces even a nominal amount of code should be running LMS on prem( AWS if your on the cloud).
Very excited to see these kinds of techniques, I think getting a 30B level reasoning model usable on consumer hardware is going to be a game changer, especially if it uses less power.
Except this being lighter than the other models, is there anything else the Gemma model is specifically good at or better than the other models at doing?
Been using the 27B QAT model for batch processing 50K+ internal documents. The 128K context is game-changing for our legal review pipeline. Though I wish the token generation was faster - at 20tps it's still too slow for interactive use compared to Claude Opus.
When I see 32B or 70B models performing similarly to 200+B models, I don’t know what to make of this. Either the latter contains more breadth of information but we have managed to distill latent capabilities to be similar, the larger models are just less efficient, or the tests are not very good.
Anyone packaged one of these in an iPhone App? I am sure it is doable, but I am curious what tokens/sec is possible these days. I would love to ship "private" AI Apps if we can get reasonable tokens/sec.
This is my first time trying to locally host a model - gave both the 12B and 27B QAT models a shot.
I was both impressed and disappointed. Setup was piss easy, and the models are great conversationalists. I have a 12 gig card available and the 12B model ran very nice and swift.
However, they're seemingly terrible at actually assisting with stuff. Tried something very basic: asked for a powershell one liner to get the native blocksize of my disks. Ended up hallucinating fields, then telling me to go off into the deep end, first elevating to admin, then using WMI, then bringing up IOCTL. Pretty unfortunate. Not sure I'll be able to put it to actual meaningful use as a result.
nice, loving the push with local models lately - always makes me wonder though, you think privacy wins out over speed and convenience in the long run or people just stick with what's quickest?
These have been out for a while; if you follow the HF link you can see, for example, the 27b quant has been downloaded from HF 64,000 times over the last 10 days.
Is there something more to this, or is just a follow up blog post?
(is it just that ollama finally has partial (no images right?) support? Or something else?)
Gemma 3 QAT Models: Bringing AI to Consumer GPUs
(developers.googleblog.com)600 points by emrah 20 April 2025 | 273 comments
Comments
I've been trying it on an M2 64GB via both Ollama and MLX. It's very, very good, and it only uses ~22Gb (via Ollama) or ~15GB (MLX) leaving plenty of memory for running other apps.
Some notes here: https://simonwillison.net/2025/Apr/19/gemma-3-qat-models/
Last night I had it write me a complete plugin for my LLM tool like this:
It gave a solid response! https://gist.github.com/simonw/feccff6ce3254556b848c27333f52... - more notes here: https://simonwillison.net/2025/Apr/20/llm-fragments-github/gemma3:27b-it-qat with open-codex, running locally, is just amazingly useful, not only for Python dev, but for Haskell and Common Lisp also.
I still like Gemini 2.5 Pro and o3 for brainstorming or working on difficult problems, but for routine work it (simply) makes me feel good to have everything open source/weights running on my own system.
Wen I bought my 32G Mac a year ago, I didn't expect to be so happy as running gemma3:27b-it-qat with open-codex locally.
That said, the first graph is misleading about the number of H100s required to run DeepSeek r1 at FP16. The model is FP8.
It runs at around 26 tokens/sec and FP16, FP8 is not supported by the Radeon 7900 GRE.
I just love it.
For coding QwQ 32b is still king. But with a 16GB VRAM card it gives me ~3 tokens/sec, which is unusable.
I tried to make Gemma 3 write a powershell script with Terminal gui interface and it ran into dead-ends and finally gave up. QwQ 32B performed a lot better.
But for most general purposes it is great. My kid's been using it to feed his school textbooks and ask it questions. It is better than anything else currently.
Somehow it is more "uptight" than llama or the chinese models like Qwen. Can't put my finger on it, the Chinese models seem nicer and more talkative.
Here is the JSON schema: https://pastebin.com/SiEJ6LEz System prompt: https://pastebin.com/R68QkfQu
We're working with a GPU-poor organization with very strict data residency requirements, and these models might be exactly what we need.
I would normally say VLLM, but the blog post notably does not mention VLLM support.
There is really no technological path towards supercomputers that fast in a human timescale and in 100 years.
The thing that makes LLMs usefull is their ability to translate concepts from one domain to the other. Overfitting on choice benchmarks, even a spread, will lower their usefullness in every general task by destorying infomation that is encoded in the weights.
Ask gemma to write a 5 paragraph essay on any niche topic and you will get plenty of statements that have an extremely small likely of existing in relation to the topic, but have a high likely of existing in related more popular topics. ChatGPT less so, but still at least one a paragraph. I'm not talking about factual errors or common oversimplifications. I'm talking about completely unrelated statements. What your asking about is largely outside it's training data of which a 27GB models gives you what? a few hundred Gigs? Seems like alot, but you have to remember that there is a lot of stuff that you probably don't care about that many people do. Stainless steel and Kubernetes are going to be well represented, your favorite media? probably not, relatively current? definitely not. Which sounds fine, until you realize that people who care about Stainless steel and Kubernetes, likely care about some much more specific aspect which isn't going to be represented and you are back to the same problem of low usability.
This is why I believe that scale is king and that both data and compute are the big walls. Google has Youtube data but they are only using it in Gemini.
Shouldn’t it fit a 5060 Ti 16GB, for instance?
Since this article publishes the optimized Q4 quantized version, it would be great if it included more comparisons between the new version and my currently used unoptimized Q4 version (such as benchmark scores).
(I deliberately wrote this reply in Chinese and had gemma-3-27b-it Q4_K_M translate it into English.)
I can imagine this will serve to drive prices for hosted llms lower.
At this level any company that produces even a nominal amount of code should be running LMS on prem( AWS if your on the cloud).
Mapping from one JSON with a lot of plain text, into a new structure and it fails every time.
Ask it to generate SVG, and it’s very simple and almost too dumb.
Nice that it doesn’t need that huge amount of RAM, and perform ok on smaller languages from my initial tests.
I was both impressed and disappointed. Setup was piss easy, and the models are great conversationalists. I have a 12 gig card available and the 12B model ran very nice and swift.
However, they're seemingly terrible at actually assisting with stuff. Tried something very basic: asked for a powershell one liner to get the native blocksize of my disks. Ended up hallucinating fields, then telling me to go off into the deep end, first elevating to admin, then using WMI, then bringing up IOCTL. Pretty unfortunate. Not sure I'll be able to put it to actual meaningful use as a result.
Am I missing something?
These have been out for a while; if you follow the HF link you can see, for example, the 27b quant has been downloaded from HF 64,000 times over the last 10 days.
Is there something more to this, or is just a follow up blog post?
(is it just that ollama finally has partial (no images right?) support? Or something else?)