Research · 9 min read

Gemma 4 on NVIDIA GB10: Quantization Benchmarks for Local Inference

A hands-on benchmark of Gemma 4 on NVIDIA GB10 comparing 31B dense and 26B-A4B MoE variants across speed, memory, thinking mode, and practical deployment tradeoffs.

Elena Voss · · Updated April 8, 2026 ·
Cover Image

Running frontier-class open models locally is no longer a science project. NVIDIA's GB10, the desktop Blackwell platform behind Project DIGITS, changes the operating envelope by combining serious compute with 128GB of unified CPU/GPU memory. That makes it a useful system for evaluating where local inference starts to feel practical instead of aspirational.

We benchmarked five Gemma 4 variants on the same GB10 box to understand a simple question: which quantization gives the best balance of speed, memory pressure, and usable quality for day-to-day AI work?

The short answer is that the biggest story is not Q4 versus Q8. It is dense versus mixture-of-experts. On this hardware, Gemma 4's 26B-A4B MoE variants are dramatically faster than the 31B dense models while preserving the quality we observed across reasoning, coding, structured extraction, and general prompt following.

Test Setup

Spec Detail
Hardware NVIDIA GB10 (Project DIGITS), 128GB unified memory
Runtime Ollama 0.20.3 in Docker
GPU utilization ~95% during generation
GPU temperature ~70 C under load
Power draw ~49W

Five Gemma 4 variants were tested:

Variant Architecture Disk Size Observed Memory During Inference
31B BF16 Dense 62GB ~62GB
31B Q8_0 Dense 33GB ~33GB
31B Q4_K_M Dense 19GB ~68GB
26B-A4B MoE Q8_0 Mixture of Experts, 4B active 28GB ~28GB
26B-A4B MoE Q4_K_M Mixture of Experts, 4B active 17GB ~17GB

The 31B Q4_K_M run is the outlier. Its runtime footprint was much higher than its on-disk size because KV cache growth and runtime buffers mattered more than the model artifact size alone. That is an important reminder for local deployment planning: storage and live memory are related, but they are not the same constraint.

Benchmark Summary

The benchmark suite covered seven practical workloads: reasoning, math, coding, JSON extraction, commonsense, creative writing, and long-form throughput. The table below shows the average generation and prompt-processing speeds across those tests.

Variant Avg Generation Avg Prompt Processing Disk Size
26B-A4B MoE Q4_K_M 61.1 tok/s 616 tok/s 17GB
26B-A4B MoE Q8_0 45.2 tok/s 414 tok/s 28GB
31B Q4_K_M 10.3 tok/s 286 tok/s 19GB
31B Q8_0 6.5 tok/s 214 tok/s 33GB
31B BF16 3.9 tok/s 143 tok/s 62GB

Two conclusions stand out immediately.

  • The 26B-A4B MoE Q4_K_M variant is roughly 6x faster than the 31B dense Q4 model.
  • The 26B-A4B MoE Q4_K_M variant is roughly 15x faster than the 31B BF16 run.

That gap is large enough to change the product experience. The dense 31B models are usable for batch work and deliberate analysis. The MoE models are fast enough for interactive chat, iterative coding, and agent-style workflows where latency compounds across steps.

Task-Level Performance

The ranking stayed effectively the same across every test category.

Test 26B MoE Q4 26B MoE Q8 31B Q4 31B Q8 31B BF16
Reasoning 61.5 45.4 10.3 6.5 3.8
Math 60.0 44.5 10.1 6.3 3.8
Coding 60.8 44.9 10.2 6.4 3.8
JSON extraction 61.9 45.6 10.4 6.5 3.9
Commonsense 62.1 46.3 10.5 6.6 3.9
Creative writing 62.4 46.3 10.6 6.7 4.0
1K-token throughput 59.4 44.1 10.1 6.3 3.8

Within the same architecture, the quantization tradeoff looked familiar. Q4 was consistently faster than Q8, and Q8 outpaced BF16. What was more interesting was how little that mattered compared with the architectural jump from dense to MoE. If the goal is better local responsiveness on GB10, MoE delivers the bigger win than chasing one more quantization tier inside the dense family.

Prompt ingestion showed the same pattern. The fastest MoE variant handled long prompts at 600 tok/s and above, which makes large contexts feel nearly instantaneous in practical use.

Thinking Mode Is Mostly a Token Budget Story

Gemma 4 includes a built-in thinking mode that expands the model's internal reasoning before it produces the visible answer. On paper, that sounds like a performance feature. In practice, the main cost is not slower per-token decoding. The cost is that the model emits far more tokens.

Using the 31B Q4_K_M variant, thinking mode barely changed raw generation speed while significantly increasing total completion time:

Test Thinking On Thinking Off On Time Off Time
Reasoning 10.4 tok/s 10.5 tok/s 27.4s 11.6s
Math 10.3 tok/s 10.4 tok/s 88.7s 42.9s
Code generation 10.3 tok/s 10.5 tok/s 84.4s 28.0s
Creative writing 10.3 tok/s 10.6 tok/s 75.5s 11.2s
Summarization 10.3 tok/s 10.6 tok/s 38.4s 6.9s
JSON extraction 10.4 tok/s 10.6 tok/s 27.7s 6.8s
Multilingual 10.3 tok/s 10.6 tok/s 62.9s 9.3s
Commonsense 10.3 tok/s 10.8 tok/s 44.1s 3.8s

For straightforward problems, thinking mode was mostly overhead.

  • On the average-speed math problem, both modes returned the correct answer of 48 mph.
  • On code generation, both modes produced a correct expand-around-center palindrome implementation.
  • On commonsense prompts, both modes handled classic traps such as the bat-and-ball problem and the strawberry letter-count task correctly.

The implication is practical: leave thinking mode off for routine workloads. Turn it on selectively for genuinely hard multi-step reasoning or when the extra internal deliberation is worth the latency.

Quality Held Up Under Aggressive Quantization

Across the tasks tested, quantization had almost no visible impact on outcome quality.

  • All five variants answered the average-speed problem correctly using the harmonic-mean logic instead of the naive average.
  • All five variants solved the bat-and-ball reflection test correctly.
  • All five variants counted the three letter r characters in "strawberry" correctly.
  • All five variants returned valid JSON for schema-bound extraction.
  • All five variants produced working code for the palindrome task.

That does not prove the variants are identical in every possible domain. The differences are more likely to appear in edge cases such as subtle creative writing, long-horizon reasoning, or highly specialized knowledge retrieval. But for the practical workloads most teams actually care about, the observed quality delta between BF16 and Q4 was negligible.

Comparison Against Other Local Models

On the same GB10 system, Gemma 4's MoE variant compares favorably with several other commonly discussed local models:

Model Architecture Quantization Generation Speed Memory
Gemma 4 26B-A4B MoE MoE, 4B active Q4_K_M 61 tok/s ~17GB
Gemma 4 26B-A4B MoE MoE, 4B active Q8_0 45 tok/s ~28GB
Qwen 3.5-35B-A3B via vLLM MoE, 3B active n/a ~30+ tok/s Lower
Gemma 4 31B Dense Q4_K_M 10.4 tok/s ~68GB
Gemma 4 31B Dense Q8_0 6.5 tok/s ~33GB
DeepSeek-R1 70B Dense Q4 ~5 tok/s ~42GB
Gemma 4 31B Dense BF16 3.9 tok/s ~62GB

This is where the GB10 platform starts to make sense as a desktop AI box. The fastest Gemma 4 MoE configuration does not just run locally. It runs fast enough that local usage no longer feels like a compromise.

Best default for interactive local work: 26B-A4B MoE Q4_K_M

  • 61 tok/s is firmly in responsive-chat territory.
  • The 17GB artifact is relatively lightweight for the capability level.
  • Quality held up across the workloads tested.
  • It is the strongest fit for coding assistants, internal copilots, retrieval systems, and agent pipelines where latency matters.

Best dense option for fuller-context experiments: 31B Q4_K_M

  • At roughly 10 tok/s, it is workable but noticeably slower.
  • It remains a reasonable choice when you explicitly want the dense 31B behavior profile.
  • The larger context window may still matter for some batch or research use cases where latency is secondary.

Lowest-value configuration in this set: 31B BF16

  • The quality gain was not measurable in these tests.
  • The speed penalty was significant.
  • The memory and storage costs are hard to justify unless a specific evaluation requires BF16.

Limits of This Benchmark

These results should be read as field notes, not as a formal benchmark suite.

  • Tests were informal single-run evaluations rather than statistically repeated trials.
  • Vision and tool use were available in the model family but were not evaluated here.
  • The quality conclusions apply to the specific workloads tested, not every domain.
  • The reported memory behavior is operationally useful, but runtime footprints can move with context length, prompt shape, and implementation details.

Even with those limits, the trend is strong enough to guide real deployment choices.

Conclusion

Gemma 4 is a credible local model family on NVIDIA GB10, but the standout configuration is clear. The 26B-A4B MoE Q4_K_M variant offers the best balance of throughput, footprint, and observed quality by a wide margin. It is fast enough to feel interactive, light enough to run comfortably, and accurate enough for the practical tasks most local AI systems need to handle.

For teams evaluating desktop-scale private AI, that is the important threshold. The local model does not need to be theoretically perfect. It needs to be good enough, fast enough, and cheap enough to stay in the workflow. On GB10, Gemma 4's MoE Q4 variant clears that bar.

Tested in April 2026 on NVIDIA GB10 hardware running Ollama 0.20.3 in Docker. Variants tested: 31B BF16, 31B Q8_0, 31B Q4_K_M, 26B-A4B MoE Q8_0, and 26B-A4B MoE Q4_K_M.

gemma 4benchmarkinglocal ainvidia gb10ollama