Gemma 3 QAT Models: Bringing AI to Consumer GPUs Hackernews Viewer

Gemma 3 QAT Models: Bringing AI to Consumer GPUs

(developers.googleblog.com)

600 points by emrah 20 April 2025 | 273 comments

Comments

simonw 20 April 2025

I think gemma-3-27b-it-qat-4bit is my new favorite local model - or at least it's right up there with Mistral Small 3.1 24B.

I've been trying it on an M2 64GB via both Ollama and MLX. It's very, very good, and it only uses ~22Gb (via Ollama) or ~15GB (MLX) leaving plenty of memory for running other apps.

Some notes here: https://simonwillison.net/2025/Apr/19/gemma-3-qat-models/

Last night I had it write me a complete plugin for my LLM tool like this:

  llm install llm-mlx
  llm mlx download-model mlx-community/gemma-3-27b-it-qat-4bit

  llm -m mlx-community/gemma-3-27b-it-qat-4bit \
    -f https://raw.githubusercontent.com/simonw/llm-hacker-news/refs/heads/main/llm_hacker_news.py \
    -f https://raw.githubusercontent.com/simonw/tools/refs/heads/main/github-issue-to-markdown.html \
    -s 'Write a new fragments plugin in Python that registers
    issue:org/repo/123 which fetches that issue
        number from the specified github repo and uses the same
        markdown logic as the HTML page to turn that into a
        fragment'

It gave a solid response! https://gist.github.com/simonw/feccff6ce3254556b848c27333f52... - more notes here: https://simonwillison.net/2025/Apr/20/llm-fragments-github/

Samin100 20 April 2025

I have a few private “vibe check” questions and the 4 bit QAT 27B model got them all correctly. I’m kind of shocked at the information density locked in just 13 GB of weights. If anyone at Deepmind is reading this — Gemma 3 27B is the single most impressive open source model I have ever used. Well done!

diggan 20 April 2025

First graph is a comparison of the "Elo Score" while using "native" BF16 precision in various models, second graph is comparing VRAM usage between native BF16 precision and their QAT models, but since this method is about doing quantization while also maintaining quality, isn't the obvious graph of comparing the quality between BF16 and QAT missing? The text doesn't seem to talk about it either, yet it's basically the topic of the blog post.

mark_l_watson 20 April 2025

Indeed!! I have swapped out qwen2.5 for gemma3:27b-it-qat using Ollama for routine work on my 32G memory Mac.

gemma3:27b-it-qat with open-codex, running locally, is just amazingly useful, not only for Python dev, but for Haskell and Common Lisp also.

I still like Gemini 2.5 Pro and o3 for brainstorming or working on difficult problems, but for routine work it (simply) makes me feel good to have everything open source/weights running on my own system.

Wen I bought my 32G Mac a year ago, I didn't expect to be so happy as running gemma3:27b-it-qat with open-codex locally.

mekpro 20 April 2025

Gemma 3 is way way better than Llama 4. I think Meta will start to lose its position in LLM mindshare. Another weakness of Llama 4 is its model size that is too large (even though it can run fast with MoE), which greatly limits the applicable users to a small percentage of enthusiasts who have enough GPU VRAM. Meanwhile, Gemma 3 is widely usable across all hardware sizes.

trebligdivad 20 April 2025

It seems pretty impressive - I'm running it on my CPU (16 core AMD 3950x) and it's very very impressive at translation, and the image description is very impressive as well. I'm getting about 2.3token/s on it (compared to under 1/s on the Calme-3.2 I was previously using). It does tend to be a bit chatty unless you tell it not to be; pretty much everything it'll give you a 'breakdown' unless you tell it not to - so for traslation my prompt is 'Translate the input to English, only output the translation' to stop it giving a breakdown of the input language.

behnamoh 20 April 2025

This is what local LLMs need—being treated like first-class citizens by the companies that make them.

That said, the first graph is misleading about the number of H100s required to run DeepSeek r1 at FP16. The model is FP8.

porphyra 20 April 2025

It is funny that Microsoft had been peddling "AI PCs" and Apple had been peddling "made for Apple Intelligence" for a while now, when in fact usable models for consumer GPUs are only barely starting to be a thing on extremely high end GPUs like the 3090.

mythz 20 April 2025

The speed gains are real, after downloading latest QAT gemma3:27b eval perf is now 1.47x faster on ollama, up from 13.72 to 20.11 tok/s (on A4000's).

manjunaths 21 April 2025

I am running this on 16 GB AMD Radeon 7900 GRE with 64 GB machine with ROCm and llama.cpp on Windows 11. I can use Open-webui or the native gui for the interface. It is made available via an internal IP to all members of my home.

It runs at around 26 tokens/sec and FP16, FP8 is not supported by the Radeon 7900 GRE.

I just love it.

For coding QwQ 32b is still king. But with a 16GB VRAM card it gives me ~3 tokens/sec, which is unusable.

I tried to make Gemma 3 write a powershell script with Terminal gui interface and it ran into dead-ends and finally gave up. QwQ 32B performed a lot better.

But for most general purposes it is great. My kid's been using it to feed his school textbooks and ask it questions. It is better than anything else currently.

Somehow it is more "uptight" than llama or the chinese models like Qwen. Can't put my finger on it, the Chinese models seem nicer and more talkative.

technologesus 21 April 2025

Just for fun I created a new personal benchmark for vision-enabled LLMs: playing minecraft. I used JSON structured output in LM Studio to create basic controls for the game. Unfortunately no matter how hard I proompted, gemma-3-27b QAT is not really able to understand simple minecraft scenarios. It would say things like "I'm now looking at a stone block. I need to break it" when it is looking out at the horizon in the desert.

Here is the JSON schema: https://pastebin.com/SiEJ6LEz System prompt: https://pastebin.com/R68QkfQu

emrah 20 April 2025

Available on ollama: https://ollama.com/library/gemma3

miki123211 20 April 2025

What would be the best way to deploy this if you're maximizing for GPU utilization in a multi-user (API) scenario? Structured output support would be a big plus.

We're working with a GPU-poor organization with very strict data residency requirements, and these models might be exactly what we need.

I would normally say VLLM, but the blog post notably does not mention VLLM support.

casey2 22 April 2025

I don't get the appeal. For LLMs to be useful at all you at least need to bin the the dozen exabit range per token, zettabit/s if you want something usable.

There is really no technological path towards supercomputers that fast in a human timescale and in 100 years.

The thing that makes LLMs usefull is their ability to translate concepts from one domain to the other. Overfitting on choice benchmarks, even a spread, will lower their usefullness in every general task by destorying infomation that is encoded in the weights.

Ask gemma to write a 5 paragraph essay on any niche topic and you will get plenty of statements that have an extremely small likely of existing in relation to the topic, but have a high likely of existing in related more popular topics. ChatGPT less so, but still at least one a paragraph. I'm not talking about factual errors or common oversimplifications. I'm talking about completely unrelated statements. What your asking about is largely outside it's training data of which a 27GB models gives you what? a few hundred Gigs? Seems like alot, but you have to remember that there is a lot of stuff that you probably don't care about that many people do. Stainless steel and Kubernetes are going to be well represented, your favorite media? probably not, relatively current? definitely not. Which sounds fine, until you realize that people who care about Stainless steel and Kubernetes, likely care about some much more specific aspect which isn't going to be represented and you are back to the same problem of low usability.

This is why I believe that scale is king and that both data and compute are the big walls. Google has Youtube data but they are only using it in Gemini.

wtcactus 20 April 2025

They keep mentioning the RTX 3090 (with 24 GB VRAM), but the model is only 14.1 GB.

Shouldn’t it fit a 5060 Ti 16GB, for instance?

Havoc 21 April 2025

Definitely my current fav. Also interesting that for many questions the response is very similar to the gemini series. Must be sharing training datasets pretty directly.

umajho 20 April 2025

I am currently using the Q4_K_M quantized version of gemma-3-27b-it locally. I previously assumed that a 27B model with image input support wouldn't be very high quality, but after actually using it, the generated responses feel better than those from my previously used DeepSeek-R1-Distill-Qwen-32B (Q4_K_M), and its recognition of images is also stronger than I expected. (I thought the model could only roughly understand the concepts in the image, but I didn't expect it to be able to recognize text within the image.)

Since this article publishes the optimized Q4 quantized version, it would be great if it included more comparisons between the new version and my currently used unoptimized Q4 version (such as benchmark scores).

(I deliberately wrote this reply in Chinese and had gemma-3-27b-it Q4_K_M translate it into English.)

999900000999 20 April 2025

Assuming this can match Claude's latest, and full time usage ( as in you have a system that's constantly running code without any user input,) you'd probably save 600 to 700 a month. A 4090 is only 2K and you'll see an ROI within 90 days.

I can imagine this will serve to drive prices for hosted llms lower.

At this level any company that produces even a nominal amount of code should be running LMS on prem( AWS if your on the cloud).

piyh 20 April 2025

Meta Maverick is crying in the shower getting so handily beat by a model with 15x fewer params

jarbus 20 April 2025

Very excited to see these kinds of techniques, I think getting a 30B level reasoning model usable on consumer hardware is going to be a game changer, especially if it uses less power.

Alifatisk 20 April 2025

Except this being lighter than the other models, is there anything else the Gemma model is specifically good at or better than the other models at doing?

holografix 20 April 2025

Could 16gb vram be enough for the 27b QAT version?

yuweiloopy2 21 April 2025

Been using the 27B QAT model for batch processing 50K+ internal documents. The 128K context is game-changing for our legal review pipeline. Though I wish the token generation was faster - at 20tps it's still too slow for interactive use compared to Claude Opus.

ece 21 April 2025

On Hugging Face: https://huggingface.co/collections/google/gemma-3-qat-67ee61...

api 20 April 2025

When I see 32B or 70B models performing similarly to 200+B models, I don’t know what to make of this. Either the latter contains more breadth of information but we have managed to distill latent capabilities to be similar, the larger models are just less efficient, or the tests are not very good.

briandear 20 April 2025

The normal Gemma models seem to work fine on Apple silicon with Metal. Am I missing something?

justanotheratom 20 April 2025

Anyone packaged one of these in an iPhone App? I am sure it is doable, but I am curious what tokens/sec is possible these days. I would love to ship "private" AI Apps if we can get reasonable tokens/sec.

punnerud 20 April 2025

Just tested the 27B, and it’s not very good at following instructions and is very limited on more complex code problems.

Mapping from one JSON with a lot of plain text, into a new structure and it fails every time.

Ask it to generate SVG, and it’s very simple and almost too dumb.

Nice that it doesn’t need that huge amount of RAM, and perform ok on smaller languages from my initial tests.

CyberShadow 20 April 2025

How does it compare to CodeGemma for programming tasks?

gigel82 21 April 2025

FWIW, the 27b Q4_K_M takes about 23Gb of VRAM with 4k context and 29Gb with 16k context and runs at ~61t/s on my 5090.

perching_aix 20 April 2025

This is my first time trying to locally host a model - gave both the 12B and 27B QAT models a shot.

I was both impressed and disappointed. Setup was piss easy, and the models are great conversationalists. I have a 12 gig card available and the 12B model ran very nice and swift.

However, they're seemingly terrible at actually assisting with stuff. Tried something very basic: asked for a powershell one liner to get the native blocksize of my disks. Ended up hallucinating fields, then telling me to go off into the deep end, first elevating to admin, then using WMI, then bringing up IOCTL. Pretty unfortunate. Not sure I'll be able to put it to actual meaningful use as a result.

gitroom 21 April 2025

nice, loving the push with local models lately - always makes me wonder though, you think privacy wins out over speed and convenience in the long run or people just stick with what's quickest?

btbuildem 20 April 2025

Is 27B the largest QAT Gemma 3? Given these size reductions, it would be amazing to have the 70B!

noodletheworld 20 April 2025

Am I missing something?

These have been out for a while; if you follow the HF link you can see, for example, the 27b quant has been downloaded from HF 64,000 times over the last 10 days.

Is there something more to this, or is just a follow up blog post?

(is it just that ollama finally has partial (no images right?) support? Or something else?)

XCSme 20 April 2025

So how does 27b-it-qat (18GB) compare to 27b-it-q4_K_M (17GB)?

mattfrommars 21 April 2025

anyone had success using Gemma 3 QAT models on Ollama with cline? They just don't work as good compared Gemini 2.0 flash provided by API

anshumankmr 21 April 2025

my trusty RTX 3060 is gonna have its day in the sun... though I have run a bunch of 7B models fairly easily on Ollama.

cheriot 20 April 2025

Is there already a Helium for GPUs?

rob_c 20 April 2025

Given how long between this being released and this community picking up on it... Lol