Nvidia DGX Spark: When benchmark numbers meet production reality

(publish.obsidian.md)

Comments

RyeCatcher 27 October 2025
Author here. I've updated the article based on your feedback. Thank you.

Key corrections:

Ollama GPU usage - I was wrong. It IS using GPU (verified 96% utilization). My "CPU-optimized backend" claim was incorrect.

FP16 vs BF16 - enum caught the critical gap: I trained with BF16, tested inference with FP16 (broken), but never tested BF16 inference. "GPU inference fundamentally broken" was overclaimed. Should be "FP16 has issues, BF16 untested (likely works)."

llama.cpp - veber-alex's official benchmark link proves it works. My issues were likely version-specific, not representative.

ARM64+CUDA maturity - bradfa was right about Jetson history. ARM64+CUDA is mature. The new combination is Blackwell+ARM64, not ARM64+CUDA itself.

The HN community caught my incomplete testing, overclaimed conclusions, and factual errors.

Ship early, iterate publicly, accept criticism gracefully.

Thanks especially to enum, veber-alex, bradfa, furyofantares, stuckinhell, jasonjmcghee, eadwu, and renaudr. The article is significantly better now.

eitally 26 October 2025
One of my colleagues wrote a first impressions blog post last week. It's from our company's perspective, but is a solid overview of the product and intended capabilities, from the POV of an AI developer or data scientist.

https://www.anaconda.com/blog/python-nvidia-dgx-spark-first-...

RyeCatcher 26 October 2025
I absolutely love it. I’ve been up for days playing with it. But there are some bleeding edge issues. I tried to write a balanced article. I would highly recommend for people that love to get their hands dirty. Blows away any consumer GPU.
eadwu 26 October 2025
There are bleeding edge issues, everyone dials into transformers so that's generally pain proof.

I haven't exactly bisected the issue but I'm pretty sure convolutions are broken on sm_121 after a certain size, getting 20x memory blowup from a convolution from a 2x batch size increase _only_ on the DGX Spark.

I haven't had any problems with inference, but I also don't use the transformers library that much.

llama.cpp was working for openai-oss last time I checked and on release, not sure if something broke along the way.

I don't exactly know if memory fragmentation is something fixable on the driver side - this might just be the problem with kernel's policy and GPL, it prevents them from automatically interfering with the memory subsystem to the granularity they'd like - see zfs and their page table antics - or so my thoughts on it is.

If you've done stuff on WSL, you have similar issues and you can fix it by running a service that normally compacts and clean memory, I have it run every hour. Note that this does impact at the very least CPU performance and memory allocation speeds, but I have not have any issue with long training runs with it (24hr+, assuming that is the issue, I have never tried without it and put that service in place since getting it due to my experience on WSL).

aseipp 26 October 2025
I'm not yet using mine for ML stuff because there are still a lot of various issues like this post outlined. But I am using mine as an ARM dev system in the meantime, and as a "workstation" it's actually quite good. The Cortex-X925 cores are Zen5 class in performance and it is overall an absolute unit for its size, I'm very impressed that a standard ARM core is pushing this level of performance for a desktop-class machine. I thought about buying a new Linux desktop recently, and this is good enough I might just plug it into a monitor and use it instead.

It is also a standard UEFI+ACPI system; one Reddit user even reported that they were able to boot up Fedora 42 and install the open kernel modules no problem. The overall delta/number of specific patches for the Canonical 6.17-nvidia tree is pretty small when I looked (the current kernel is 6.11). That and the likelihood the consumer variant will support Windows hopefully bodes well for its upstream Linux compatibility, I hope.

To be fair, most of this also true of Strix Halo from what I can tell (most benchmarks put the DGX furthest ahead at prompt processing and a bit ahead at raw token output. But the software is still buggy and Blackwell is still a bumpy ride overall, so it might get better). But I think it's mostly the pricing that is holding it back. I'm curious what the consumer variant will be priced at.

veber-alex 26 October 2025
The llama.cpp issues are strange.

There are official benchmarks of the Spark running multiple models just fine on llama.cpp

https://github.com/ggml-org/llama.cpp/discussions/16578

MaKey 26 October 2025
Why would you get this when a Ryzen AI Max+ 395 with 128 GB is a fraction of the price?
enum 26 October 2025
- https://publish.obsidian.md/aixplore/Practical+Applications/...

   Does it work if you change to torch.bfloat16?
- https://publish.obsidian.md/aixplore/Practical+Applications/...

  The PyTorch 2.9 wheels do work. You can pip install torch --index-url <whatever-it-is> and it just works. You do need to build flash attention from source, which takes an hour or so.
jsheard 26 October 2025
No mention of the monstrous 200GbE NIC, seems like a waste if people aren't finding a use for it.
stuckinhell 26 October 2025
I'm utterly shocked at the article saying GPU inference (PyTorch/Transformers)isn't working. Numerical instability produces bad outputs, Not viable for real-time serving, Wait for driver/CUDA updates!

My job just got me and our entire team a DGX spark. I'm impressed at the ease of use for ollama models I couldn't run on my laptop. gpt-oss:120b is shockingly better than what I thought it would be from running the 20b model on my laptop.

The DGX has changed my mind about the future being small specialized models.

semessier 26 October 2025
Nvidia products including from the GPU/CUDA libraries world, the NICs and switches tend to feel like MVP frequently. It works in some cases, hopefully in the end but they are far from polished products without rough edges.
MomsAVoxell 26 October 2025
So, it seems like this makes the DGX a viable ARM-based workstation, for those of us who need/want such a thing, while also offering a relatively decent AI/ML environment.

Two things need to happen for me to get excited about this:

1. It stimulates other manufacturers into building their own DGX-class workstations.

2. This all eventually gets shipped in a decent laptop product.

As much as it pains me, until that happens, it still seems like Apple Sillicon is the more viable option, if not the most ethical.

pertymcpert 26 October 2025
This article is AI garbage:

ARM64 Architecture: Not x86_64 (limited ML ecosystem maturity) No PyTorch wheels for ARM64+CUDA (must use Docker) Most ML tools optimized for x86

No evidence for any of this whatsoever. The author just asked Claude/claude code to write their article and it just plain hallucinated some rubbish.

renaudr 26 October 2025
Have you tried to run GPT-OSS-120b using TRT-LLM (as you hint NVIDIA probably did it for their benchmark)?

https://cookbook.openai.com/articles/gpt-oss/run-nvidia

RyeCatcher 26 October 2025
Would love to hear from others using the spark for model training and development.
amelius 26 October 2025
Kind of weird that (gpu) training works but inference doesn't ...
spwa4 27 October 2025
Am I reading this right? I was expecting much more performance. My 64G M1 Max has 40.72 tok/s on ollama/GPT-OSS-20B (less than half the price of this machine), and M4 Max 128G from a colleague (but 32G would work) gets about 67 tok/s on ollama/GPT-OSS-20B, and apparently the most recent software updates push that to 78 tok/s. The DGX Spark gets 82.74 tok/s.

Ryzen Max 395+ gets you 55 tok/s [1]

[1] https://www.reddit.com/r/LocalLLaMA/comments/1nabcek/anyone_...

buyucu 27 October 2025
Strix Halo from AMD appears to be a much more consumer-friendly alternative than DGX Spark.
fxtentacle 26 October 2025
„273 GB/sec memory bandwidth“

Really? Less RAM bw than an Epyc CPU? And 4x to 8x less than a consumer GPU?

How come this doesn’t massively limit LLM inference speeds?

suprjami 26 October 2025
So I can spend thousands of dollars to have an unstable training environment and inference performance worse than a US$200 3060.

Wow. Where do I sign up?