I run Q4_K_XL. All it takes to run to get about 6tk/sec is 512gb of ram and 2 3090 GPUs with llama.cpp -cmoe. I also have crappy DDR4, 2400mhz, 3200mhz will bring that speed up to about 9tk/sec. I also have ok 32core epyc CPU, a better 64core would bring it up to about 11tk/sec. I did a budget build before the crazy hardware cost and I regret it everyday. Nevertheless, it's fantastic being able to run this model at home. It's great for planning, one shot prompting once you have a plan or all the context you need. This entire hardware cost $2400 when it was built. If you're willing to be resourceful, you can find ways to run these models at home. I often get the silly question of why, and suggestions about how much I can save using cloud API, but the Fable drama has opened up eyes on why it's good for us to be independent. Thanks team unsloth, Q4_K_XL is solid, if you are going to grab a quant, make sure to get the K_XL variant if it can fit.
DwarfStar work in progress numbers: I see 14 tokens/sec generation, that slopes to 10 t/s with longer 10k or more context size. Consider that the indexed attention requires evaluating 2048 selected rows, 2x DeepSeek and with less compression, so the performances with larger contexts here to south faster. Prefill can be 180 t/s on small contexts to 150 t/s and less with larger contexts. I used DeepSeek v4 PRO in this conditions, it is usable but it is far from the 35 t/s 400 t/s prefill you get with DeepSeek v4 Flash 2 bit on a MacBook m5 max. But likely my implementation is yet not optimized enough, so a bit more performance can be obtained. I'm using 4 bit quants. The model is also definitely less sparse than DeepSeek v4, so it activates a bigger percentage of parameters. If it works decently at 2-bit, that would be a win even for machines where 4-bit fits, since this would mean 2x memory (equivalent) bandwidth basically for the routed experts.
Local inference needs really hard a 1.2 / 1.5 T/s memory bandwidth system with 512GB and 2/3 times the GPU compute of Mac Studio M3 Ultra, at an affordable 10/15k price point. A variant with 1TB memory would also be welcomed at 20k price point.
The most interesting part of this to me is not the benchmark table, but the packaging.
A model like GLM-5.2 being available as GGUF, usable through llama.cpp/Ollama/vLLM/SGLang/LM Studio, and wrapped for local agent workflows changes the category. It stops being an impressive open model exists and starts becoming this is something a small team can actually put into its development stack.
For instance, company buys an RX6000 setup for say $15k total. They could use this for handling data heavy sifting that would otherwise be a lot of Claude tokens.
It doesn't need to be as good as frontier-best. Just good enough.
I could see a business of people packaging this and handing it to companies who want Help Desk bots without any extra setup.
"it can fit" on 256GB of RAM, but it will be heavily quantized and still run very slowly. The headline number is not token generation, its prompt processing. So if you get 10 tok/s and an API gives you 20-30 tok/s, it doesn't seem that bad on its face, but a mac studio or any other machine that's not loading all of it into GPU will do PP 20-50X slower than a purely GPU based setup, which is what actually makes this unusable without $50k in GPUs.
On top of that, you will still be heavily quantized.
There is a push from multiple directions at the same time:
- new AI desktops with GB10s. They are relatively cheap and you can cluster them and load 1TB of VRAM
- Nvidia, amd, intel, Cerebras etc pushing new hardware
- oss models getting crazy good, like glm 5.2
- flash models getting very good like deepseek V4 flash
- quantizations
- harnesses being able to use different models (big for difficult stuff, small for grunt work)
So hopefully soon for the ones who want to break free from APIs, we will be able to host at home a cluster of AI desktops at a reasonable price with Opus-level capabilities, can't wait!!
I feel like the gap is closing to be able to run good enough models locally even for coding and I would assume it could make some companies a bit nervous. Am I wrong about that?
Is this really worth it, though? Throughout the years my experience with quantized models has been that they feel like a lobotomized version of the original. Doesn't matter if it's an LLM, dedicated diffusion model or some other dedicated task. Sure, they get the job done. But a lot worse. The only ones that can somewhat hold up are the ones provided by the vendor directly. Gemma4 comes to mind. However I suspect they have some secret sauce other than just "let's quantize this" since they have the original model and its data at hand.
There should be more native 4bit, 1.25bit and likewise models. Those actually work great while making them smaller in comparison. But I guess there is some reason for them being pretty niche.
Can somebody help me understand the Quantization Analysis? It says "dynamic 4-bit UD-Q4_K_XL and dynamic 5-bit UD-Q5_K_XL are generally lossless" while showing a top-1% token agreement on the chart of 97.5%. Not what I would consider "generally lossless". Is this implying that some post-processing is going to account for the 2.5% loss? Beam search?
GLM 5.2 is the first time I'm actually excited about AI! I'm not the most bullish on AI code for several few reasons, but the biggest reason is the ownership model. We all know we're near the tail end of the "subsidized pricing" window for AI, and I've been hoping for so long to get an open weight model that is _close enough_ to the SOTA before this window closes - and we actually got it! I'm excited to be able to in the near future run GLM locally, and use these things like a tool instead of living in this for-rent model for the rest of my life. I'm excited to actually enjoy programming again
I have high respect for unsloth's work, helping millions to get started with local AI, but this post appears kind of download bait.
Offloading too many layers to CPU is not going to work at all. I have tried this many times and had to rm -rf on those heavy hf cache folders. Also I doubt 1-bit or 2-bit quants of GLM 5.2, running mostly outside of VRAM can beat Q8_0 of Qwen3.6-27B fully loaded in VRAM - on usefulness.
I really don't think anyone is going to have a good time trying to run it on anything with 256GB of RAM no matter what the post says. 512 is the much more realistic minimum. I'm fortunate enough to have two 512GB RAM dual xeon workstations in my home office that I bought cheap before the price rise to mess around with things...
One advantage about local LLM: You could serialize the context yourself, without being constrained by APIs. And let's not forget, the Big 2 encrypt their thinking. If you use custom clients, which is a very grey area alreay, being able to produce the context string raw is a big bonus. Takes away a lot of annoying constraints and needless mystique/obfuscation.
But I don't know how usable GLM 5.2 is vs the Big 2.
I've got access to a 192GB RAM Mac Studio, which is below the stated minimum RAM. Can swapping off fast disk be used to make it work out, especially since it's MoE?
I have up to 1tb of ddr4 in my server but it only has a 12gb vram 3060. Would getting a 24gb vram make this a viable system or am I throwing money away?
...a bit of an odd question: how well do LLMs losslessly compress, as in for cold storage?
I definitely don't have the hardware to run this model at any kind of reasonable speed (and I don't want to use a super aggressive quantization that would kill performance). Even so, I think it would be cool to retain an offline copy, in case... I don't really know, a solar flare destroys the internet some day, or maybe a zombie apocalypse. It would just be cool to have.
But 1.5 TB is a bit too much! If it could be compressed down into something semi kind of reasonable, that would be fun!
Any time I see one of these posts about models of this size a quote comes to mind – "Your Scientists Were So Preoccupied With Whether Or Not They Could, They Didn’t Stop To Think If They Should".
Only a select few have the hardware required to run this to begin with, and even then the forecasted performance makes me wonder if it’s worth it at all.
wonder if AMD's new ai chip can run this with ease? I'm seriously consider buying it. GLM 5.2 is just shy of GPT 5.4 so I would welcome offloading any grunt work locally
I am very excited for local LLMs I think we may have GPT 5.5-xhigh level of performance for under 2000 EUR
This should put more pressure on the frontier models to avoid sitting on any fancy stuff and lower token prices as a whole.
Nothing beats a local LLM disconnected from the cloud.
GLM-5.2 – How to Run Locally
(unsloth.ai)571 points by TechTechTech 23 hours ago | 269 comments
Comments
Local inference needs really hard a 1.2 / 1.5 T/s memory bandwidth system with 512GB and 2/3 times the GPU compute of Mac Studio M3 Ultra, at an affordable 10/15k price point. A variant with 1TB memory would also be welcomed at 20k price point.
https://unsloth.ai/docs/models/glm-5.2#usage-guide
In a prior thread, someone said it would take $500k in hardware:
https://news.ycombinator.com/item?id=48629970
A model like GLM-5.2 being available as GGUF, usable through llama.cpp/Ollama/vLLM/SGLang/LM Studio, and wrapped for local agent workflows changes the category. It stops being an impressive open model exists and starts becoming this is something a small team can actually put into its development stack.
For instance, company buys an RX6000 setup for say $15k total. They could use this for handling data heavy sifting that would otherwise be a lot of Claude tokens.
It doesn't need to be as good as frontier-best. Just good enough.
I could see a business of people packaging this and handing it to companies who want Help Desk bots without any extra setup.
On top of that, you will still be heavily quantized.
- new AI desktops with GB10s. They are relatively cheap and you can cluster them and load 1TB of VRAM
- Nvidia, amd, intel, Cerebras etc pushing new hardware
- oss models getting crazy good, like glm 5.2
- flash models getting very good like deepseek V4 flash
- quantizations
- harnesses being able to use different models (big for difficult stuff, small for grunt work)
So hopefully soon for the ones who want to break free from APIs, we will be able to host at home a cluster of AI desktops at a reasonable price with Opus-level capabilities, can't wait!!
Kinda shows they have a headstart rather than a magic moat
There should be more native 4bit, 1.25bit and likewise models. Those actually work great while making them smaller in comparison. But I guess there is some reason for them being pretty niche.
Offloading too many layers to CPU is not going to work at all. I have tried this many times and had to rm -rf on those heavy hf cache folders. Also I doubt 1-bit or 2-bit quants of GLM 5.2, running mostly outside of VRAM can beat Q8_0 of Qwen3.6-27B fully loaded in VRAM - on usefulness.
But I don't know how usable GLM 5.2 is vs the Big 2.
Do the runes make it smarter or just run faster (or both)?
...a bit of an odd question: how well do LLMs losslessly compress, as in for cold storage?
I definitely don't have the hardware to run this model at any kind of reasonable speed (and I don't want to use a super aggressive quantization that would kill performance). Even so, I think it would be cool to retain an offline copy, in case... I don't really know, a solar flare destroys the internet some day, or maybe a zombie apocalypse. It would just be cool to have.
But 1.5 TB is a bit too much! If it could be compressed down into something semi kind of reasonable, that would be fun!
But I do like Unsloth Studio, quite a lot. It's nicely designed.
Only a select few have the hardware required to run this to begin with, and even then the forecasted performance makes me wonder if it’s worth it at all.
And yet Apple won't sell them to you anymore. And I'm not too confident it will be even possible to hand then 10k to get one again.
I am very excited for local LLMs I think we may have GPT 5.5-xhigh level of performance for under 2000 EUR
This should put more pressure on the frontier models to avoid sitting on any fancy stuff and lower token prices as a whole.
Nothing beats a local LLM disconnected from the cloud.