Research · 20 min read

Qwen3.6-35B on NVIDIA GB10: 243 llama.cpp Runs to Find the Best Local Quant

A detailed benchmark of Qwen3.6-35B-A3B on NVIDIA GB10 using 243 llama.cpp runs, comparing BF16, Q8, Q5, Q4, and MXFP4 GGUF variants for local AI inference, RAG, agents, and long-context workloads.

Marcus Callahan ·
Image of GB10

Summary

Qwen3.6-35B-A3B runs comfortably on NVIDIA GB10-class hardware. That was not the real question.

The real question was which GGUF variant actually makes sense to use every day.

Raw BF16 fits in memory, but it is not the best daily-driver option. Q8 is larger without being faster. Q5 works, but it does not clearly justify itself. The real decision comes down to UD-Q4_K_M versus MXFP4_MOE.

After 243 llama.cpp tuning runs and two sample lm-evaluation-harness quality checks, my recommendation is straightforward:

  • Use UD-Q4_K_M for general local chat, coding help, structured extraction, and agent work.
  • Use MXFP4_MOE when prompt ingestion is the bottleneck: RAG, long-context workflows, document processing, and agents carrying a lot of state.
  • Keep raw BF16 as a reference baseline, not a daily driver.
  • Keep flash attention on.
  • Start with f16/f16 KV cache unless memory pressure forces you to experiment.

That is the short version. The rest of this article explains why.


Executive Summary

QuestionAnswer
Best all-around local variantUD-Q4_K_M
Best long-prompt / RAG throughputMXFP4_MOE
Fastest decode resultUD-Q4_K_M — 65.3 tok/s
Fastest prompt / prefill resultMXFP4_MOE — 2,496 tok/s
Best sample quality resultUD-Q4_K_M
Best daily-driver settings for chat-ngl 99 -fa 1 -c 8192 -b 1024 -ub 256
Best starting settings for RAG / long prompts-ngl 99 -fa 1 -c 8192 -b 4096 -ub 1024
Safe KV cache defaultf16/f16
Variant I would not use dailyBF16 GGUF

The headline result is not just decode speed. The bigger result is the long-prompt gap.

On the best mixed long-prompt profile:

  • MXFP4 was 7.9x faster than BF16
  • UD-Q4 was 7.7x faster than BF16

That changes what feels practical locally.

For a chatbot, the difference between 30 tok/s and 65 tok/s is obvious. But for RAG and agent systems, prompt processing is often the larger bottleneck. Every retrieved document, tool result, scratchpad entry, JSON schema, and conversation-history chunk has to be ingested before the model can answer.

That is where quantized Qwen3.6 on GB10 starts to look very strong.


Test System

SpecDetail
HardwareNVIDIA GB10 / Project DIGITS-class desktop
GPU reported by llama.cppNVIDIA GB10, compute capability 12.1
VRAM / unified memory reported by llama.cpp124,546 MiB
System memory121 GiB
CPU20-core ARM aarch64: Cortex-X925 + Cortex-A725
OS / kernelLinux 6.17.0-1014-nvidia, aarch64
CUDA13.0, nvcc 13.0.88
NVIDIA driver580.126.09
Runtimellama.cpp built locally with CUDA
llama.cpp commitcce09f0b2b37028caf6f549c976ba16b3e8703d8
BuildGNU 13.3.0 for Linux aarch64
Main offload setting-ngl 99 full GPU offload
Benchmark date2026-05-12 / 2026-05-13

The model family tested was Qwen3.6-35B-A3B MoE in GGUF form. The quantized variants came from the Unsloth GGUF release, and the BF16 GGUF was used as the raw-precision reference.


Variants Tested

VariantTypeApprox. Disk SizeRole in Test
MXFP4_MOEUnsloth MXFP4 MoE21 GBAggressive practical quant; expected speed winner
UD-Q4_K_MUnsloth Q421 GBMain daily-driver candidate
UD-Q5_K_MUnsloth Q525 GBMiddle-ground candidate
Q8_0Unsloth Q835 GBHigher-precision candidate
BF16Raw BF16 GGUF, 2 shards65 GBReference baseline

Headline Performance Results

VariantSizeBest DecodeBest Prompt / PrefillBest Mixed Long-PromptBest Mixed Setting
MXFP421 GB65.0 tok/s2,496.1 tok/s1,646.4 tok/sprefill_heavy, FA=1, batch=4096, ubatch=1024, KV=f16/f16
UD-Q421 GB65.3 tok/s2,311.9 tok/s1,595.3 tok/sprefill_heavy, FA=1, batch=4096, ubatch=1024, KV=f16/f16
UD-Q525 GB61.3 tok/s2,129.3 tok/s1,499.2 tok/sprefill_heavy, FA=1, batch=4096, ubatch=1024, KV=f16/f16
Q8_035 GB57.7 tok/s1,985.5 tok/s1,432.0 tok/sprefill_heavy, FA=1, batch=4096, ubatch=1024, KV=f16/f16
BF1665 GB29.5 tok/s784.9 tok/s208.6 tok/srag_long, FA=1, batch=2048, ubatch=512, KV=f16/f16

Compared with BF16, the best Unsloth variants were dramatically faster.

VariantDecode vs. BF16Prefill vs. BF16Mixed Long-Prompt vs. BF16
MXFP42.21x3.18x7.89x
UD-Q42.22x2.95x7.65x
UD-Q52.08x2.71x7.19x
Q8_01.96x2.53x6.87x

BF16 fit on the machine. That part was impressive.

But it did not make sense for real work. Its disk footprint was roughly 3x larger than Q4 or MXFP4, while decode speed was less than half and long-prompt throughput was far behind.


What the Numbers Mean in Real Use

Decode speed is the number people usually notice first. It controls how fast text appears once the model starts answering.

At roughly 65 tok/s, both Q4 and MXFP4 feel interactive. They are fast enough for local chat, coding loops, short analysis, and tool-using agents where the model may need to produce many short responses.

BF16 at roughly 29 tok/s is not unusable, but it feels like a reference mode. It is the mode I would use to sanity-check quantization behavior, not the mode I would leave running for daily work.

The larger story is prompt ingestion.

RAG and agent systems do not spend all their time generating polished prose. They spend a lot of time reading:

  • Retrieved chunks
  • Instructions
  • Prior conversation turns
  • Tool outputs
  • JSON schemas
  • Scratchpads
  • Agent state

If the prompt is 8K tokens, the model has to ingest that context before it can answer. That is where the tuned Q4 and MXFP4 runs pulled away from BF16.

In plain English:

  • Chat feels better when decode is high.
  • RAG feels better when prompt / prefill is high.
  • Agents feel better when both are high, because every step compounds latency.

That is why I would not pick a local quant based on decode speed alone.


Best Settings by Workload

General Chat and Coding Assistant Use

Start here:

llama-server \
  -m Qwen3.6-35B-A3B-UD-Q4_K_M.gguf \
  -ngl 99 \
  -fa 1 \
  -c 8192 \
  -b 1024 \
  -ub 256 \
  --host 127.0.0.1 \
  --port 8080

Why this setting:

UD-Q4 had the best decode result in the sweep at 65.3 tok/s, and it also beat MXFP4 in the sample quality checks. For a general local assistant, that balance matters more than squeezing out the final few percent of prefill throughput.

Best Decode Rows

VariantBest DecodeSetting
UD-Q465.3 tok/schat_standard, FA=1, batch=1024, ubatch=256, KV=f16/f16
MXFP465.0 tok/schat_short, FA=1, batch=1024, ubatch=256, KV=f16/f16
UD-Q561.3 tok/srag_long, FA=1, batch=2048, ubatch=512, KV=f16/f16
Q8_057.7 tok/sprefill_heavy, FA=1, batch=1024, ubatch=256, KV=f16/f16
BF1629.5 tok/srag_long, FA=1, batch=2048, ubatch=512, KV=f16/f16

RAG, Document Processing, and Agent Workloads

Start here:

llama-server \
  -m Qwen3.6-35B-A3B-MXFP4_MOE.gguf \
  -ngl 99 \
  -fa 1 \
  -c 8192 \
  -b 4096 \
  -ub 1024 \
  --host 127.0.0.1 \
  --port 8080

Why this setting:

MXFP4 produced the best prompt / prefill and mixed long-prompt throughput in the entire sweep. If the workload is mostly long prompts plus short answers, that matters more than a tiny decode difference.

Best Mixed Long-Prompt Rows

VariantMixed tok/sSetting
MXFP41,646.4prefill_heavy, FA=1, batch=4096, ubatch=1024, KV=f16/f16
UD-Q41,595.3prefill_heavy, FA=1, batch=4096, ubatch=1024, KV=f16/f16
UD-Q51,499.2prefill_heavy, FA=1, batch=4096, ubatch=1024, KV=f16/f16
Q8_01,432.0prefill_heavy, FA=1, batch=4096, ubatch=1024, KV=f16/f16
BF16208.6rag_long, FA=1, batch=2048, ubatch=512, KV=f16/f16

Speed, Size, and Quality Tradeoff

The most important practical result is that Q4 and MXFP4 both sit in the sweet spot:

  • 21 GB on disk
  • Roughly 65 tok/s decode
  • Strongest long-prompt throughput
  • Practical enough for daily local inference

Q8 Did Not Justify Its Extra Footprint

Q8_0 was larger than Q4 and MXFP4, but slower in the performance sweep.

It may still matter for sensitive workloads where a specific task proves that Q8 avoids mistakes made by lower-bit variants. But this benchmark does not justify Q8 as the default.

Q5 Was Squeezed From Both Sides

Q5 worked, but it landed in an awkward middle ground.

It was larger than Q4 and MXFP4, while also being slower than both. If you already have a specific reason to prefer Q5, it is usable. But if you are choosing from scratch, Q4 and MXFP4 are more compelling.

BF16 Is a Reference Mode, Not a Daily Driver

BF16 fitting in memory is useful. It gives you a reference point.

But for practical local use, the cost is too high. It is much larger, much slower, and far behind on long-prompt workloads.


Sample Quality Checks: Q4 Beat MXFP4

The performance sweep pointed to Q4 and MXFP4 as the two serious candidates, so I ran sample generation-based lm-evaluation-harness checks against both variants through llama-server.

These are not official leaderboard scores. They are local sanity checks with limited sample sizes.

They are still useful because both variants were tested the same way, on the same machine, with the same runtime.

VariantGSM8K Flexible Exact Match, Limit 100IFEval Prompt Strict, Limit 25IFEval Instruction Strict, Limit 25
MXFP40.44 ± 0.04990.20 ± 0.08160.405
UD-Q40.54 ± 0.05010.24 ± 0.08720.432

Q4 won both sample evals.

The margins are not large enough to make broad claims about model quality, but they are large enough to affect the deployment recommendation. If I had to pick one quant today, I would pick UD-Q4_K_M.


A Useful Failure Mode: Thinking Tokens and Extraction

One issue showed up immediately in the GSM8K samples.

The model sometimes emitted <think>-style reasoning or verbose setup text, and lm-eval’s extractor grabbed the wrong number.

For example, on the classic ducks-and-eggs GSM8K prompt, the expected answer is 18. The response began by restating the problem and included the sale price $2; the flexible extractor picked $2 instead of the final answer.

That does not necessarily mean the model could not solve the problem. It means the local completion format and answer extraction were not aligned.

That matters in production too.

If you use a reasoning model locally, you need to control output format. For math, extraction, and automation tasks, use explicit instructions such as:

Return only the final numeric answer. Do not include reasoning.

Or use a JSON schema if the runtime supports it.

This is one reason I treat the eval numbers as sample checks, not official scores.


Comparison to the Earlier Gemma 4 GB10 Benchmark

In my earlier Gemma 4 GB10 benchmark, the fastest practical Gemma 4 variant was the 26B-A4B MoE Q4_K_M run at about:

  • 61.1 tok/s generation
  • 616 tok/s prompt processing

The 31B dense BF16 run was much slower, at around:

  • 3.9 tok/s generation
  • 143 tok/s prompt processing

Qwen3.6-35B-A3B changes the local story in two ways.

Model / RunGenerationPrompt ProcessingDisk Size
Gemma 4 26B-A4B MoE Q4_K_M61.1 tok/s616 tok/s17 GB
Qwen3.6-35B-A3B UD-Q4_K_M65.3 tok/s best decode2,311.9 tok/s best prefill21 GB
Qwen3.6-35B-A3B MXFP4_MOE65.0 tok/s best decode2,496.1 tok/s best prefill21 GB

This is not an apples-to-apples benchmark suite. The Gemma 4 article used Ollama practical workloads, while this Qwen3.6 run used llama.cpp sweep profiles and lm-eval samples.

Still, the direction is clear: Qwen3.6’s local throughput is strong enough that the bottleneck moves from:

Can I run it?

to:

Which settings match my workload?

The lesson is similar across both articles: mixture-of-experts models are a very good fit for GB10-class local inference. Dense BF16 runs are useful for reference, but the practical daily drivers are quantized MoE variants.


What Surprised Me

BF16 Fit, But I Would Not Use It Day to Day

The machine can load the BF16 GGUF. That is impressive.

But the throughput gap is too large. BF16 was roughly 65 GB on disk and only reached 29.5 tok/s decode. For most local work, I would rather run Q4 or MXFP4 and spend the saved memory on context, parallel services, embeddings, OCR, or other parts of the stack.

Q8 Did Not Earn Its Extra 14 GB

Q8_0 was larger than Q4/MXFP4 and slower in the performance sweep.

Without a stronger quality result, it is hard to recommend. I would only reach for it if a specific task shows Q4 or MXFP4 making mistakes that Q8 avoids.

Q4 Beat MXFP4 in the Quality Samples

I expected MXFP4 to be the speed pick and Q4 to be the conservative pick.

The performance data mostly supports that. What I did not expect was Q4 beating MXFP4 on both GSM8K and IFEval samples while still matching it on decode speed.

That is why Q4 is my default recommendation.

IFEval Is Expensive Locally

GSM8K-100 finished quickly. IFEval was a different story.

A full IFEval-100 attempt had many examples taking 60–120+ seconds, and some took several minutes. The IFEval-25 runs were enough for a sample check, but full IFEval is an overnight job.

That is useful information if you are building your own benchmark loop:

  • Run GSM8K-style checks often.
  • Run IFEval less often.
  • Run the full suite overnight.

Flash Attention, Batch, UBatch, and KV Cache

Flash attention should stay on for this setup. The best rows consistently used:

-fa 1

For chat and decode-heavy usage, smaller batching was usually enough:

-b 1024 -ub 256

For long-prompt workloads, larger batching helped:

-b 4096 -ub 1024

KV cache compression was not a free win.

The q8_0/q8_0 KV settings worked with flash attention on, but many flash-attention-off plus q8_0 KV combinations produced no usable throughput rows in this build.

My default is:

KV=f16/f16

I would only switch KV types after validating the exact workload and runtime build.


Things That Did Not Work Cleanly

This is the section I always want in benchmark posts and rarely see.

llama-cli prompt mode hung in this environment. llama-bench and llama-server were stable, so I used those instead.

Multiple-choice / loglikelihood tasks such as HellaSwag, ARC-Challenge, TruthfulQA MC2, and Winogrande hit a local API compatibility issue. llama-server’s completion endpoint did not return the exact token_logprobs schema expected by lm-eval’s OpenAI/local-completions adapter in this environment.

Rather than patch the adapter during the run, I kept the quality section to generation-based tasks.

IFEval-100 was too slow for an interactive benchmark loop. I stopped the oversized run and replaced it with GSM8K-100 plus IFEval-25.

Finally, 60 sweep entries produced no usable throughput rows, mostly q8_0 KV cache combinations with flash attention disabled. I excluded those from best-setting comparisons and treated them as invalid runtime settings for this build.


Methodology

The performance suite used llama-bench with three repetitions per run.

Workload Profiles

ProfilePrompt TokensGenerated TokensWhat It Approximates
decode_only0256Pure generation speed
chat_short256256Short chat turn
chat_standard512512Normal assistant response
rag_long2048256Retrieved context plus concise answer
prefill_heavy8192128Document / RAG / agent state ingestion

Variables Swept

  • Flash attention on/off
  • Batch size: 1024, 2048, 4096
  • UBatch size: 256, 512, 1024
  • KV cache: f16/f16 and q8_0/q8_0
  • Full GPU offload: -ngl 99

The quality checks used lm-evaluation-harness through local llama-server instances with limited sample sizes for GSM8K and IFEval.


Validity and Limitations

This is a local inference tuning study, not a leaderboard submission.

The performance results are the strongest part of the article. They compare the same model family on the same machine using the same llama.cpp build and controlled runtime settings. Those results are useful for deciding what to run on this GB10 system.

The quality results are narrower.

GSM8K used 100 examples, not the full 1,319-example test set. IFEval used 25 examples, not the full 541. lm-eval itself warns that --limit should be used for testing, not official metrics.

The sample evals are still useful for relative comparison between Q4 and MXFP4, but I would not cite them as official Qwen3.6 scores.

There is also a formatting caveat. GSM8K strict-match was 0.0 for both variants because the local completion format did not consistently emit the canonical #### answer format. I reported flexible exact match because it extracts the numeric answer from free-form generations.


What I Would Run Next

For a follow-up, I would not rerun everything. I would focus on the two serious candidates: Q4 and MXFP4.

The next tests I would run are:

  1. Full GSM8K on Q4 overnight.
  2. Full IFEval on Q4 overnight.
  3. Patch or replace the logprobs adapter so HellaSwag, ARC-Challenge, TruthfulQA MC2, Winogrande, and selected MMLU subjects work cleanly through llama.cpp.
  4. Add HumanEval and MBPP if coding use cases matter.
  5. Compare Q4 and MXFP4 on a real internal RAG workload, not just synthetic prompt profiles.

The open question is not whether GB10 can run this class of model. It can.

The real question is which local setup gives you the best experience for the work you actually do.

For me, after this run, the answer is simple:

Use Q4 by default. Use MXFP4 when long prompts dominate.

gb10local inference