Summary

Qwen3.6-35B-A3B runs comfortably on NVIDIA GB10-class hardware. That was not the real question.

The real question was which GGUF variant actually makes sense to use every day.

Raw BF16 fits in memory, but it is not the best daily-driver option. Q8 is larger without being faster. Q5 works, but it does not clearly justify itself. The real decision comes down to UD-Q4_K_M versus MXFP4_MOE.

After 243 llama.cpp tuning runs and two sample lm-evaluation-harness quality checks, my recommendation is straightforward:

Use UD-Q4_K_M for general local chat, coding help, structured extraction, and agent work.
Use MXFP4_MOE when prompt ingestion is the bottleneck: RAG, long-context workflows, document processing, and agents carrying a lot of state.
Keep raw BF16 as a reference baseline, not a daily driver.
Keep flash attention on.
Start with f16/f16 KV cache unless memory pressure forces you to experiment.

That is the short version. The rest of this article explains why.

Executive Summary

Question	Answer
Best all-around local variant	UD-Q4_K_M
Best long-prompt / RAG throughput	MXFP4_MOE
Fastest decode result	UD-Q4_K_M — 65.3 tok/s
Fastest prompt / prefill result	MXFP4_MOE — 2,496 tok/s
Best sample quality result	UD-Q4_K_M
Best daily-driver settings for chat	`-ngl 99 -fa 1 -c 8192 -b 1024 -ub 256`
Best starting settings for RAG / long prompts	`-ngl 99 -fa 1 -c 8192 -b 4096 -ub 1024`
Safe KV cache default	`f16/f16`
Variant I would not use daily	BF16 GGUF

The headline result is not just decode speed. The bigger result is the long-prompt gap.

On the best mixed long-prompt profile:

MXFP4 was 7.9x faster than BF16
UD-Q4 was 7.7x faster than BF16

That changes what feels practical locally.

For a chatbot, the difference between 30 tok/s and 65 tok/s is obvious. But for RAG and agent systems, prompt processing is often the larger bottleneck. Every retrieved document, tool result, scratchpad entry, JSON schema, and conversation-history chunk has to be ingested before the model can answer.

That is where quantized Qwen3.6 on GB10 starts to look very strong.

Test System

Spec	Detail
Hardware	NVIDIA GB10 / Project DIGITS-class desktop
GPU reported by llama.cpp	NVIDIA GB10, compute capability 12.1
VRAM / unified memory reported by llama.cpp	124,546 MiB
System memory	121 GiB
CPU	20-core ARM aarch64: Cortex-X925 + Cortex-A725
OS / kernel	Linux 6.17.0-1014-nvidia, aarch64
CUDA	13.0, nvcc 13.0.88
NVIDIA driver	580.126.09
Runtime	llama.cpp built locally with CUDA
llama.cpp commit	`cce09f0b2b37028caf6f549c976ba16b3e8703d8`
Build	GNU 13.3.0 for Linux aarch64
Main offload setting	`-ngl 99` full GPU offload
Benchmark date	2026-05-12 / 2026-05-13

The model family tested was Qwen3.6-35B-A3B MoE in GGUF form. The quantized variants came from the Unsloth GGUF release, and the BF16 GGUF was used as the raw-precision reference.

Variants Tested

Variant	Type	Approx. Disk Size	Role in Test
MXFP4_MOE	Unsloth MXFP4 MoE	21 GB	Aggressive practical quant; expected speed winner
UD-Q4_K_M	Unsloth Q4	21 GB	Main daily-driver candidate
UD-Q5_K_M	Unsloth Q5	25 GB	Middle-ground candidate
Q8_0	Unsloth Q8	35 GB	Higher-precision candidate
BF16	Raw BF16 GGUF, 2 shards	65 GB	Reference baseline

Headline Performance Results

Variant	Size	Best Decode	Best Prompt / Prefill	Best Mixed Long-Prompt	Best Mixed Setting
MXFP4	21 GB	65.0 tok/s	2,496.1 tok/s	1,646.4 tok/s	prefill_heavy, FA=1, batch=4096, ubatch=1024, KV=f16/f16
UD-Q4	21 GB	65.3 tok/s	2,311.9 tok/s	1,595.3 tok/s	prefill_heavy, FA=1, batch=4096, ubatch=1024, KV=f16/f16
UD-Q5	25 GB	61.3 tok/s	2,129.3 tok/s	1,499.2 tok/s	prefill_heavy, FA=1, batch=4096, ubatch=1024, KV=f16/f16
Q8_0	35 GB	57.7 tok/s	1,985.5 tok/s	1,432.0 tok/s	prefill_heavy, FA=1, batch=4096, ubatch=1024, KV=f16/f16
BF16	65 GB	29.5 tok/s	784.9 tok/s	208.6 tok/s	rag_long, FA=1, batch=2048, ubatch=512, KV=f16/f16

Compared with BF16, the best Unsloth variants were dramatically faster.

Variant	Decode vs. BF16	Prefill vs. BF16	Mixed Long-Prompt vs. BF16
MXFP4	2.21x	3.18x	7.89x
UD-Q4	2.22x	2.95x	7.65x
UD-Q5	2.08x	2.71x	7.19x
Q8_0	1.96x	2.53x	6.87x

BF16 fit on the machine. That part was impressive.

But it did not make sense for real work. Its disk footprint was roughly 3x larger than Q4 or MXFP4, while decode speed was less than half and long-prompt throughput was far behind.

What the Numbers Mean in Real Use

Decode speed is the number people usually notice first. It controls how fast text appears once the model starts answering.

At roughly 65 tok/s, both Q4 and MXFP4 feel interactive. They are fast enough for local chat, coding loops, short analysis, and tool-using agents where the model may need to produce many short responses.

BF16 at roughly 29 tok/s is not unusable, but it feels like a reference mode. It is the mode I would use to sanity-check quantization behavior, not the mode I would leave running for daily work.

The larger story is prompt ingestion.

RAG and agent systems do not spend all their time generating polished prose. They spend a lot of time reading:

Retrieved chunks
Instructions
Prior conversation turns
Tool outputs
JSON schemas
Scratchpads
Agent state

If the prompt is 8K tokens, the model has to ingest that context before it can answer. That is where the tuned Q4 and MXFP4 runs pulled away from BF16.

In plain English:

Chat feels better when decode is high.
RAG feels better when prompt / prefill is high.
Agents feel better when both are high, because every step compounds latency.

That is why I would not pick a local quant based on decode speed alone.

Best Settings by Workload

General Chat and Coding Assistant Use

Start here:

llama-server \
  -m Qwen3.6-35B-A3B-UD-Q4_K_M.gguf \
  -ngl 99 \
  -fa 1 \
  -c 8192 \
  -b 1024 \
  -ub 256 \
  --host 127.0.0.1 \
  --port 8080

Why this setting:

UD-Q4 had the best decode result in the sweep at 65.3 tok/s, and it also beat MXFP4 in the sample quality checks. For a general local assistant, that balance matters more than squeezing out the final few percent of prefill throughput.

Best Decode Rows

Variant	Best Decode	Setting
UD-Q4	65.3 tok/s	chat_standard, FA=1, batch=1024, ubatch=256, KV=f16/f16
MXFP4	65.0 tok/s	chat_short, FA=1, batch=1024, ubatch=256, KV=f16/f16
UD-Q5	61.3 tok/s	rag_long, FA=1, batch=2048, ubatch=512, KV=f16/f16
Q8_0	57.7 tok/s	prefill_heavy, FA=1, batch=1024, ubatch=256, KV=f16/f16
BF16	29.5 tok/s	rag_long, FA=1, batch=2048, ubatch=512, KV=f16/f16

RAG, Document Processing, and Agent Workloads

Start here:

llama-server \
  -m Qwen3.6-35B-A3B-MXFP4_MOE.gguf \
  -ngl 99 \
  -fa 1 \
  -c 8192 \
  -b 4096 \
  -ub 1024 \
  --host 127.0.0.1 \
  --port 8080

Why this setting:

MXFP4 produced the best prompt / prefill and mixed long-prompt throughput in the entire sweep. If the workload is mostly long prompts plus short answers, that matters more than a tiny decode difference.

Best Mixed Long-Prompt Rows

Variant	Mixed tok/s	Setting
MXFP4	1,646.4	prefill_heavy, FA=1, batch=4096, ubatch=1024, KV=f16/f16
UD-Q4	1,595.3	prefill_heavy, FA=1, batch=4096, ubatch=1024, KV=f16/f16
UD-Q5	1,499.2	prefill_heavy, FA=1, batch=4096, ubatch=1024, KV=f16/f16
Q8_0	1,432.0	prefill_heavy, FA=1, batch=4096, ubatch=1024, KV=f16/f16
BF16	208.6	rag_long, FA=1, batch=2048, ubatch=512, KV=f16/f16

Speed, Size, and Quality Tradeoff

The most important practical result is that Q4 and MXFP4 both sit in the sweet spot:

21 GB on disk
Roughly 65 tok/s decode
Strongest long-prompt throughput
Practical enough for daily local inference

Q8 Did Not Justify Its Extra Footprint

Q8_0 was larger than Q4 and MXFP4, but slower in the performance sweep.

It may still matter for sensitive workloads where a specific task proves that Q8 avoids mistakes made by lower-bit variants. But this benchmark does not justify Q8 as the default.

Q5 Was Squeezed From Both Sides

Q5 worked, but it landed in an awkward middle ground.

It was larger than Q4 and MXFP4, while also being slower than both. If you already have a specific reason to prefer Q5, it is usable. But if you are choosing from scratch, Q4 and MXFP4 are more compelling.

BF16 Is a Reference Mode, Not a Daily Driver

BF16 fitting in memory is useful. It gives you a reference point.

But for practical local use, the cost is too high. It is much larger, much slower, and far behind on long-prompt workloads.

Sample Quality Checks: Q4 Beat MXFP4

The performance sweep pointed to Q4 and MXFP4 as the two serious candidates, so I ran sample generation-based lm-evaluation-harness checks against both variants through llama-server.

These are not official leaderboard scores. They are local sanity checks with limited sample sizes.

They are still useful because both variants were tested the same way, on the same machine, with the same runtime.

Variant	GSM8K Flexible Exact Match, Limit 100	IFEval Prompt Strict, Limit 25	IFEval Instruction Strict, Limit 25
MXFP4	0.44 ± 0.0499	0.20 ± 0.0816	0.405
UD-Q4	0.54 ± 0.0501	0.24 ± 0.0872	0.432

Q4 won both sample evals.

The margins are not large enough to make broad claims about model quality, but they are large enough to affect the deployment recommendation. If I had to pick one quant today, I would pick UD-Q4_K_M.

A Useful Failure Mode: Thinking Tokens and Extraction

One issue showed up immediately in the GSM8K samples.

The model sometimes emitted <think>-style reasoning or verbose setup text, and lm-eval’s extractor grabbed the wrong number.

For example, on the classic ducks-and-eggs GSM8K prompt, the expected answer is 18. The response began by restating the problem and included the sale price $2; the flexible extractor picked $2 instead of the final answer.

That does not necessarily mean the model could not solve the problem. It means the local completion format and answer extraction were not aligned.

That matters in production too.

If you use a reasoning model locally, you need to control output format. For math, extraction, and automation tasks, use explicit instructions such as:

Return only the final numeric answer. Do not include reasoning.

Or use a JSON schema if the runtime supports it.

This is one reason I treat the eval numbers as sample checks, not official scores.

Comparison to the Earlier Gemma 4 GB10 Benchmark

In my earlier Gemma 4 GB10 benchmark, the fastest practical Gemma 4 variant was the 26B-A4B MoE Q4_K_M run at about:

61.1 tok/s generation
616 tok/s prompt processing

The 31B dense BF16 run was much slower, at around:

3.9 tok/s generation
143 tok/s prompt processing

Qwen3.6-35B-A3B changes the local story in two ways.

Model / Run	Generation	Prompt Processing	Disk Size
Gemma 4 26B-A4B MoE Q4_K_M	61.1 tok/s	616 tok/s	17 GB
Qwen3.6-35B-A3B UD-Q4_K_M	65.3 tok/s best decode	2,311.9 tok/s best prefill	21 GB
Qwen3.6-35B-A3B MXFP4_MOE	65.0 tok/s best decode	2,496.1 tok/s best prefill	21 GB

This is not an apples-to-apples benchmark suite. The Gemma 4 article used Ollama practical workloads, while this Qwen3.6 run used llama.cpp sweep profiles and lm-eval samples.

Still, the direction is clear: Qwen3.6’s local throughput is strong enough that the bottleneck moves from:

Can I run it?

to:

Which settings match my workload?

The lesson is similar across both articles: mixture-of-experts models are a very good fit for GB10-class local inference. Dense BF16 runs are useful for reference, but the practical daily drivers are quantized MoE variants.

What Surprised Me

BF16 Fit, But I Would Not Use It Day to Day

The machine can load the BF16 GGUF. That is impressive.

But the throughput gap is too large. BF16 was roughly 65 GB on disk and only reached 29.5 tok/s decode. For most local work, I would rather run Q4 or MXFP4 and spend the saved memory on context, parallel services, embeddings, OCR, or other parts of the stack.

Q8 Did Not Earn Its Extra 14 GB

Q8_0 was larger than Q4/MXFP4 and slower in the performance sweep.

Without a stronger quality result, it is hard to recommend. I would only reach for it if a specific task shows Q4 or MXFP4 making mistakes that Q8 avoids.

Q4 Beat MXFP4 in the Quality Samples

I expected MXFP4 to be the speed pick and Q4 to be the conservative pick.

The performance data mostly supports that. What I did not expect was Q4 beating MXFP4 on both GSM8K and IFEval samples while still matching it on decode speed.

That is why Q4 is my default recommendation.

IFEval Is Expensive Locally

GSM8K-100 finished quickly. IFEval was a different story.

A full IFEval-100 attempt had many examples taking 60–120+ seconds, and some took several minutes. The IFEval-25 runs were enough for a sample check, but full IFEval is an overnight job.

That is useful information if you are building your own benchmark loop:

Run GSM8K-style checks often.
Run IFEval less often.
Run the full suite overnight.

Flash Attention, Batch, UBatch, and KV Cache

Flash attention should stay on for this setup. The best rows consistently used:

-fa 1

For chat and decode-heavy usage, smaller batching was usually enough:

-b 1024 -ub 256

For long-prompt workloads, larger batching helped:

-b 4096 -ub 1024

KV cache compression was not a free win.

The q8_0/q8_0 KV settings worked with flash attention on, but many flash-attention-off plus q8_0 KV combinations produced no usable throughput rows in this build.

My default is:

KV=f16/f16

I would only switch KV types after validating the exact workload and runtime build.

Things That Did Not Work Cleanly

This is the section I always want in benchmark posts and rarely see.

llama-cli prompt mode hung in this environment. llama-bench and llama-server were stable, so I used those instead.

Multiple-choice / loglikelihood tasks such as HellaSwag, ARC-Challenge, TruthfulQA MC2, and Winogrande hit a local API compatibility issue. llama-server’s completion endpoint did not return the exact token_logprobs schema expected by lm-eval’s OpenAI/local-completions adapter in this environment.

Rather than patch the adapter during the run, I kept the quality section to generation-based tasks.

IFEval-100 was too slow for an interactive benchmark loop. I stopped the oversized run and replaced it with GSM8K-100 plus IFEval-25.

Finally, 60 sweep entries produced no usable throughput rows, mostly q8_0 KV cache combinations with flash attention disabled. I excluded those from best-setting comparisons and treated them as invalid runtime settings for this build.

Methodology

The performance suite used llama-bench with three repetitions per run.

Workload Profiles

Profile	Prompt Tokens	Generated Tokens	What It Approximates
decode_only	0	256	Pure generation speed
chat_short	256	256	Short chat turn
chat_standard	512	512	Normal assistant response
rag_long	2048	256	Retrieved context plus concise answer
prefill_heavy	8192	128	Document / RAG / agent state ingestion

Variables Swept

Flash attention on/off
Batch size: 1024, 2048, 4096
UBatch size: 256, 512, 1024
KV cache: f16/f16 and q8_0/q8_0
Full GPU offload: -ngl 99

The quality checks used lm-evaluation-harness through local llama-server instances with limited sample sizes for GSM8K and IFEval.

Validity and Limitations

This is a local inference tuning study, not a leaderboard submission.

The performance results are the strongest part of the article. They compare the same model family on the same machine using the same llama.cpp build and controlled runtime settings. Those results are useful for deciding what to run on this GB10 system.

The quality results are narrower.

GSM8K used 100 examples, not the full 1,319-example test set. IFEval used 25 examples, not the full 541. lm-eval itself warns that --limit should be used for testing, not official metrics.

The sample evals are still useful for relative comparison between Q4 and MXFP4, but I would not cite them as official Qwen3.6 scores.

There is also a formatting caveat. GSM8K strict-match was 0.0 for both variants because the local completion format did not consistently emit the canonical #### answer format. I reported flexible exact match because it extracts the numeric answer from free-form generations.

What I Would Run Next

For a follow-up, I would not rerun everything. I would focus on the two serious candidates: Q4 and MXFP4.

The next tests I would run are:

Full GSM8K on Q4 overnight.
Full IFEval on Q4 overnight.
Patch or replace the logprobs adapter so HellaSwag, ARC-Challenge, TruthfulQA MC2, Winogrande, and selected MMLU subjects work cleanly through llama.cpp.
Add HumanEval and MBPP if coding use cases matter.
Compare Q4 and MXFP4 on a real internal RAG workload, not just synthetic prompt profiles.

The open question is not whether GB10 can run this class of model. It can.

The real question is which local setup gives you the best experience for the work you actually do.

For me, after this run, the answer is simple:

Use Q4 by default. Use MXFP4 when long prompts dominate.