Gemma 4 on NVIDIA GB10: Quantization Benchmarks for Local Inference

Running frontier-class open models locally is no longer a science project. NVIDIA's GB10, the desktop Blackwell platform behind Project DIGITS, changes the operating envelope by combining serious compute with 128GB of unified CPU/GPU memory. That makes it a useful system for evaluating where local inference starts to feel practical instead of aspirational.

We benchmarked five Gemma 4 variants on the same GB10 box to understand a simple question: which quantization gives the best balance of speed, memory pressure, and usable quality for day-to-day AI work?

The short answer is that the biggest story is not Q4 versus Q8. It is dense versus mixture-of-experts. On this hardware, Gemma 4's 26B-A4B MoE variants are dramatically faster than the 31B dense models while preserving the quality we observed across reasoning, coding, structured extraction, and general prompt following.

Test Setup

Spec	Detail
Hardware	NVIDIA GB10 (Project DIGITS), 128GB unified memory
Runtime	Ollama 0.20.3 in Docker
GPU utilization	~95% during generation
GPU temperature	~70 C under load
Power draw	~49W

Five Gemma 4 variants were tested:

Variant	Architecture	Disk Size	Observed Memory During Inference
31B BF16	Dense	62GB	~62GB
31B Q8_0	Dense	33GB	~33GB
31B Q4_K_M	Dense	19GB	~68GB
26B-A4B MoE Q8_0	Mixture of Experts, 4B active	28GB	~28GB
26B-A4B MoE Q4_K_M	Mixture of Experts, 4B active	17GB	~17GB

The 31B Q4_K_M run is the outlier. Its runtime footprint was much higher than its on-disk size because KV cache growth and runtime buffers mattered more than the model artifact size alone. That is an important reminder for local deployment planning: storage and live memory are related, but they are not the same constraint.

Benchmark Summary

The benchmark suite covered seven practical workloads: reasoning, math, coding, JSON extraction, commonsense, creative writing, and long-form throughput. The table below shows the average generation and prompt-processing speeds across those tests.

Variant	Avg Generation	Avg Prompt Processing	Disk Size
26B-A4B MoE Q4_K_M	61.1 tok/s	616 tok/s	17GB
26B-A4B MoE Q8_0	45.2 tok/s	414 tok/s	28GB
31B Q4_K_M	10.3 tok/s	286 tok/s	19GB
31B Q8_0	6.5 tok/s	214 tok/s	33GB
31B BF16	3.9 tok/s	143 tok/s	62GB

Two conclusions stand out immediately.

The 26B-A4B MoE Q4_K_M variant is roughly 6x faster than the 31B dense Q4 model.
The 26B-A4B MoE Q4_K_M variant is roughly 15x faster than the 31B BF16 run.

That gap is large enough to change the product experience. The dense 31B models are usable for batch work and deliberate analysis. The MoE models are fast enough for interactive chat, iterative coding, and agent-style workflows where latency compounds across steps.

Task-Level Performance

The ranking stayed effectively the same across every test category.

Test	26B MoE Q4	26B MoE Q8	31B Q4	31B Q8	31B BF16
Reasoning	61.5	45.4	10.3	6.5	3.8
Math	60.0	44.5	10.1	6.3	3.8
Coding	60.8	44.9	10.2	6.4	3.8
JSON extraction	61.9	45.6	10.4	6.5	3.9
Commonsense	62.1	46.3	10.5	6.6	3.9
Creative writing	62.4	46.3	10.6	6.7	4.0
1K-token throughput	59.4	44.1	10.1	6.3	3.8

Within the same architecture, the quantization tradeoff looked familiar. Q4 was consistently faster than Q8, and Q8 outpaced BF16. What was more interesting was how little that mattered compared with the architectural jump from dense to MoE. If the goal is better local responsiveness on GB10, MoE delivers the bigger win than chasing one more quantization tier inside the dense family.

Prompt ingestion showed the same pattern. The fastest MoE variant handled long prompts at 600 tok/s and above, which makes large contexts feel nearly instantaneous in practical use.

Thinking Mode Is Mostly a Token Budget Story

Gemma 4 includes a built-in thinking mode that expands the model's internal reasoning before it produces the visible answer. On paper, that sounds like a performance feature. In practice, the main cost is not slower per-token decoding. The cost is that the model emits far more tokens.

Using the 31B Q4_K_M variant, thinking mode barely changed raw generation speed while significantly increasing total completion time:

Test	Thinking On	Thinking Off	On Time	Off Time
Reasoning	10.4 tok/s	10.5 tok/s	27.4s	11.6s
Math	10.3 tok/s	10.4 tok/s	88.7s	42.9s
Code generation	10.3 tok/s	10.5 tok/s	84.4s	28.0s
Creative writing	10.3 tok/s	10.6 tok/s	75.5s	11.2s
Summarization	10.3 tok/s	10.6 tok/s	38.4s	6.9s
JSON extraction	10.4 tok/s	10.6 tok/s	27.7s	6.8s
Multilingual	10.3 tok/s	10.6 tok/s	62.9s	9.3s
Commonsense	10.3 tok/s	10.8 tok/s	44.1s	3.8s

For straightforward problems, thinking mode was mostly overhead.

On the average-speed math problem, both modes returned the correct answer of 48 mph.
On code generation, both modes produced a correct expand-around-center palindrome implementation.
On commonsense prompts, both modes handled classic traps such as the bat-and-ball problem and the strawberry letter-count task correctly.

The implication is practical: leave thinking mode off for routine workloads. Turn it on selectively for genuinely hard multi-step reasoning or when the extra internal deliberation is worth the latency.

Quality Held Up Under Aggressive Quantization

Across the tasks tested, quantization had almost no visible impact on outcome quality.

All five variants answered the average-speed problem correctly using the harmonic-mean logic instead of the naive average.
All five variants solved the bat-and-ball reflection test correctly.
All five variants counted the three letter r characters in "strawberry" correctly.
All five variants returned valid JSON for schema-bound extraction.
All five variants produced working code for the palindrome task.

That does not prove the variants are identical in every possible domain. The differences are more likely to appear in edge cases such as subtle creative writing, long-horizon reasoning, or highly specialized knowledge retrieval. But for the practical workloads most teams actually care about, the observed quality delta between BF16 and Q4 was negligible.

Comparison Against Other Local Models

On the same GB10 system, Gemma 4's MoE variant compares favorably with several other commonly discussed local models:

Model	Architecture	Quantization	Generation Speed	Memory
Gemma 4 26B-A4B MoE	MoE, 4B active	Q4_K_M	61 tok/s	~17GB
Gemma 4 26B-A4B MoE	MoE, 4B active	Q8_0	45 tok/s	~28GB
Qwen 3.5-35B-A3B via vLLM	MoE, 3B active	n/a	~30+ tok/s	Lower
Gemma 4 31B	Dense	Q4_K_M	10.4 tok/s	~68GB
Gemma 4 31B	Dense	Q8_0	6.5 tok/s	~33GB
DeepSeek-R1 70B	Dense	Q4	~5 tok/s	~42GB
Gemma 4 31B	Dense	BF16	3.9 tok/s	~62GB

This is where the GB10 platform starts to make sense as a desktop AI box. The fastest Gemma 4 MoE configuration does not just run locally. It runs fast enough that local usage no longer feels like a compromise.

Recommended Deployment Choices

Best default for interactive local work: 26B-A4B MoE Q4_K_M

61 tok/s is firmly in responsive-chat territory.
The 17GB artifact is relatively lightweight for the capability level.
Quality held up across the workloads tested.
It is the strongest fit for coding assistants, internal copilots, retrieval systems, and agent pipelines where latency matters.

Best dense option for fuller-context experiments: 31B Q4_K_M

At roughly 10 tok/s, it is workable but noticeably slower.
It remains a reasonable choice when you explicitly want the dense 31B behavior profile.
The larger context window may still matter for some batch or research use cases where latency is secondary.

Lowest-value configuration in this set: 31B BF16

The quality gain was not measurable in these tests.
The speed penalty was significant.
The memory and storage costs are hard to justify unless a specific evaluation requires BF16.

Limits of This Benchmark

These results should be read as field notes, not as a formal benchmark suite.

Tests were informal single-run evaluations rather than statistically repeated trials.
Vision and tool use were available in the model family but were not evaluated here.
The quality conclusions apply to the specific workloads tested, not every domain.
The reported memory behavior is operationally useful, but runtime footprints can move with context length, prompt shape, and implementation details.

Even with those limits, the trend is strong enough to guide real deployment choices.

Conclusion

Gemma 4 is a credible local model family on NVIDIA GB10, but the standout configuration is clear. The 26B-A4B MoE Q4_K_M variant offers the best balance of throughput, footprint, and observed quality by a wide margin. It is fast enough to feel interactive, light enough to run comfortably, and accurate enough for the practical tasks most local AI systems need to handle.

For teams evaluating desktop-scale private AI, that is the important threshold. The local model does not need to be theoretically perfect. It needs to be good enough, fast enough, and cheap enough to stay in the workflow. On GB10, Gemma 4's MoE Q4 variant clears that bar.

Tested in April 2026 on NVIDIA GB10 hardware running Ollama 0.20.3 in Docker. Variants tested: 31B BF16, 31B Q8_0, 31B Q4_K_M, 26B-A4B MoE Q8_0, and 26B-A4B MoE Q4_K_M.