Local AI hardware comparisons usually collapse into one number: tokens per second.

That number matters, but it no longer tells the whole story.

Modern local AI workloads are not just simple 7B chatbot demos. They now include long-context coding assistants, RAG pipelines, autonomous agents, document extraction systems, OCR-plus-LLM workflows, and 30B-to-70B class models.

For those workloads, the better question is not simply:

Which GPU is faster?

The better question is:

Which machine gives the better local AI experience for the models, quantization levels, and context sizes you actually want to run?

To test that, I compared two very different local AI systems:

System	Accelerator	Memory Profile
RTX 4090 workstation	NVIDIA GeForce RTX 4090	24GB VRAM
NVIDIA GB10 system	NVIDIA GB10	~124GB CUDA-visible memory reported by llama.cpp

The short version is simple:

The RTX 4090 is brutally fast when the model fits cleanly in 24GB of VRAM.

GB10 is slower on clean-fit workloads, but becomes far more useful once model size, quantization level, or context length pushes beyond the 4090’s memory comfort zone.

That became obvious in the most important result of the test: Qwen3.6 35B Q4 vs Q5.

At Q4, the RTX 4090 crushed GB10.

At Q5, the RTX 4090 technically ran the model, but performance collapsed because it hit the practical VRAM wall. GB10 stayed stable and usable.

That is the real story of local AI hardware in 2026:

The RTX 4090 is fastest when it fits. GB10 is better when fit becomes the problem.

Test Systems

System	GPU / Accelerator	Memory	Notes
GB10	NVIDIA GB10	~124,546 MiB CUDA-visible memory	ARM/aarch64, CUDA 13.0
RTX 4090 workstation	NVIDIA GeForce RTX 4090	24,563 MiB VRAM	Windows 10 Pro, i9-14900K, ~128GB system RAM

The important distinction is that the RTX 4090 workstation has plenty of system RAM, but only 24GB of GPU VRAM.

For local LLM inference, that boundary matters.

If the model, KV cache, and runtime buffers fit cleanly inside VRAM, the 4090 is extremely fast. If they do not, performance can collapse quickly.

GB10 does not match the 4090’s raw throughput on smaller clean-fit workloads, but it has far more accelerator-visible memory. That makes it much more forgiving for larger models, heavier quantization levels, long context tests, and memory-heavy experiments.

Software Setup

Both systems used llama.cpp.

GB10

llama.cpp commit: cce09f0b2b37
CUDA: 13.0
Main flags:
-ngl 99
-fa 1 / -fa on
-ctk f16
-ctv f16

RTX 4090

llama.cpp Windows CUDA prebuilt binary
CUDA backend active
Main flags:
-ngl 99
-fa 1 / -fa on
-ctk f16
-ctv f16

The 4090 used a Windows CUDA prebuilt binary, while GB10 used a source build. This should be viewed as a practical workstation comparison, not a formal microarchitecture benchmark.

Still, the differences were large enough that the practical conclusions are clear.

Models Tested

The main model family tested was:

Qwen3.6-35B-A3B

Quantized variants:

Model	Approx. Size	Practical Meaning
Qwen3.6 MXFP4	~20.22 GiB	Fast, 4090-friendly
Qwen3.6 UD-Q4_K_M	~20.61 GiB	Practical 4090 sweet spot
Qwen3.6 UD-Q5_K_M	~24.64 GiB	Crosses the 4090 comfort zone
Qwen3.6 Q8_0	~34.37 GiB	GB10/headroom territory
Qwen3.6 BF16 shards	~64.62 GiB total	GB10/headroom territory

I also tested a larger memory-wall model:

Llama-3.3-70B-Instruct Q4_K_M
Approx. size: ~42.5GB

That dense 70B model is not a clean full-GPU fit for a 24GB RTX 4090. It can be attempted with partial offload, but that is not the same workload class as a model that fits cleanly in VRAM.

Benchmark Workloads

I used three categories of testing:

Synthetic llama-bench throughput tests
Long-context / KV-cache pressure tests
Real llama-server prompt tests

The synthetic matrix used prompt and generation sizes such as:

Workload	Prompt / Generation	What It Approximates
short_chat	256 / 256	Interactive chat
standard_assistant	512 / 512	Normal assistant response
coding_edit	2048 / 512	Coding or document context
rag_long	8192 / 256	RAG prompt ingestion
agent_state	16384 / 256	Agent scratchpad / long tool trace

Additional long-context stress tests used:

32768 / 128
65536 / 128

Real-world prompt tests used actual llama-server /completion calls with:

temperature: 0
generated tokens: 192
ignore_eos: true
KV cache: f16

The real prompt categories were:

article_synthesis_real_8k
code_review_real_16k
agent_trace_real_32k

The labels describe the workload category. The measured prompt token counts are shown later.

Headline Result

For Qwen3.6 35B Q4/MXFP4, the RTX 4090 is dramatically faster than GB10.

Machine	Model	Best Prefill	Best Decode	Best Mixed
GB10	Qwen3.6 MXFP4	2458.4 tok/s	64.6 tok/s	1584.7 tok/s
GB10	Qwen3.6 Q4	2291.7 tok/s	65.1 tok/s	1533.3 tok/s
RTX 4090	Qwen3.6 MXFP4	7508.0 tok/s	168.0 tok/s	4879.8 tok/s
RTX 4090	Qwen3.6 Q4	7429.1 tok/s	168.7 tok/s	4900.9 tok/s

In simple terms:

Model	RTX 4090 Decode	GB10 Decode
Qwen3.6 Q4/MXFP4	~160–168 tok/s	~64–65 tok/s

When Qwen3.6 35B fits cleanly inside 24GB of VRAM, the RTX 4090 wins decisively.

Qwen3.6 MXFP4 Results

Machine	Workload	PP tok/s	TG tok/s	Mixed tok/s
GB10	short_chat	2023.3	64.4	126.4
GB10	standard_assistant	1996.9	63.9	125.4
GB10	coding_edit	2458.1	63.5	293.2
GB10	rag_long	2458.4	64.1	1183.7
GB10	agent_state	2439.4	64.6	1584.7
RTX 4090	short_chat	5150.1	168.0	330.0
RTX 4090	standard_assistant	5171.4	167.7	328.0
RTX 4090	coding_edit	7508.0	164.0	783.1
RTX 4090	rag_long	7204.3	164.8	3487.3
RTX 4090	agent_state	7434.5	159.2	4879.8

RTX 4090 speedup over GB10:

Workload	Decode Speedup	Prefill Speedup	Mixed Speedup
short_chat	2.61x	2.55x	2.61x
standard_assistant	2.63x	2.59x	2.62x
coding_edit	2.58x	3.05x	2.67x
rag_long	2.57x	2.93x	2.95x
agent_state	2.47x	3.05x	3.08x

Qwen3.6 Q4 Results

Machine	Workload	PP tok/s	TG tok/s	Mixed tok/s
GB10	short_chat	1836.1	63.9	125.5
GB10	standard_assistant	1828.6	63.9	125.2
GB10	coding_edit	2288.1	65.1	292.2
GB10	rag_long	2291.7	63.9	1155.1
GB10	agent_state	2284.2	65.0	1533.3
RTX 4090	short_chat	5138.8	160.5	333.6
RTX 4090	standard_assistant	5158.7	160.2	328.0
RTX 4090	coding_edit	7394.2	161.2	792.1
RTX 4090	rag_long	7247.0	168.7	3524.3
RTX 4090	agent_state	7429.1	163.8	4900.9

RTX 4090 speedup over GB10:

Workload	Decode Speedup	Prefill Speedup	Mixed Speedup
short_chat	2.51x	2.80x	2.66x
standard_assistant	2.51x	2.82x	2.62x
coding_edit	2.47x	3.23x	2.71x
rag_long	2.64x	3.16x	3.05x
agent_state	2.52x	3.25x	3.20x

This is the clean-fit 4090 story:

Qwen3.6 35B Q4 fits. The RTX 4090 is 2.5x to 3.25x faster.

Long-Context Stress: 32K and 64K

One important question was whether the RTX 4090 would fall over once context length increased.

Surprisingly, Qwen3.6 Q4 and MXFP4 still ran successfully on the RTX 4090 at 64K prompt length with f16 KV cache.

That matters.

The RTX 4090 is not only fast at short context. For Qwen3.6 Q4/MXFP4, it stayed fast even at 64K.

Machine	Model	Prompt / Gen	PP tok/s	TG tok/s	Mixed tok/s
GB10	Qwen3.6 MXFP4	32768 / 128	2391.5	63.5	2184.4
RTX 4090	Qwen3.6 MXFP4	32768 / 128	7504.6	158.7	6969.9
GB10	Qwen3.6 MXFP4	65536 / 128	2465.5	64.6	2124.0
RTX 4090	Qwen3.6 MXFP4	65536 / 128	7439.4	159.3	6680.4
GB10	Qwen3.6 Q4	32768 / 128	2320.4	62.1	2074.0
RTX 4090	Qwen3.6 Q4	32768 / 128	7276.9	164.2	6975.2
GB10	Qwen3.6 Q4	65536 / 128	2182.2	64.4	2031.0
RTX 4090	Qwen3.6 Q4	65536 / 128	7399.7	161.9	6656.8

Speedup at long context:

Model	Workload	Prefill Speedup	Decode Speedup	Mixed Speedup
Qwen3.6 Q4	32K	3.14x	2.64x	3.36x
Qwen3.6 Q4	64K	3.39x	2.51x	3.28x
Qwen3.6 MXFP4	32K	3.14x	2.50x	3.19x
Qwen3.6 MXFP4	64K	3.02x	2.46x	3.15x

The key nuance:

The RTX 4090 can run Qwen3.6 35B Q4/MXFP4 even at 64K context in this setup, but it is close to the VRAM limit.

During a Qwen3.6 Q4 64K power and memory observation on the RTX 4090:

Metric	RTX 4090
Elapsed time	44.6 sec
Prefill	7451.9 tok/s
Decode	168.9 tok/s
Mixed	6651.0 tok/s
Avg active power	340.7 W
Max power	453.8 W
Max observed GPU memory	23,909 MiB
Mixed tok/s/watt	19.52
Decode tok/s/watt	0.496

That is almost the entire 24GB card.

So the correct conclusion is not:

The 4090 runs out of memory on long context.

The better conclusion is:

Qwen3.6 Q4/MXFP4 are excellent RTX 4090 fits, even at 64K context, but they are near the edge of the 24GB envelope.

Real-World Prompt Tests

Synthetic benchmarks are useful, but I also wanted to see real prompt behavior through llama-server.

The real prompt suite used three practical workloads:

Prompt	Measured Prompt Tokens	Task Type
article_synthesis_real_8k	4,831	Article / research-note synthesis
code_review_real_16k	9,659	Backend code review
agent_trace_real_32k	10,383	Long agent trace recovery

Each request generated 192 tokens.

Qwen3.6 Q4 Real-Prompt Results

Machine	Prompt	Prompt Tokens	Elapsed	Prompt tok/s	Decode tok/s
GB10	Article synthesis	4,831	5.1s	2566.5	63.1
RTX 4090	Article synthesis	4,831	2.2s	5245.6	151.7
GB10	Code review	9,659	7.1s	2523.6	60.8
RTX 4090	Code review	9,659	2.6s	8457.2	147.5
GB10	Agent trace	10,383	7.5s	2557.6	60.3
RTX 4090	Agent trace	10,383	2.7s	8492.8	150.8

RTX 4090 speedup over GB10 on Q4:

Prompt	Prompt-Ingest Speedup	Decode Speedup	Wall-Clock Speedup
Article synthesis	2.04x	2.41x	2.28x
Code review	3.35x	2.43x	2.80x
Agent trace	3.32x	2.50x	2.82x

This confirms the synthetic benchmarks in a more realistic serving path.

Qwen3.6 Q4 is a very strong RTX 4090 daily-driver model.

The RTX 4090 ingested real long prompts at up to roughly 8.5K tok/s and decoded around 148–152 tok/s.

GB10 handled the same real prompts at around 2.5K–2.6K tok/s prefill and 60–63 tok/s decode.

The VRAM Cliff: Qwen3.6 Q4 vs Q5

This was the most important test.

Qwen3.6 Q4 is about 20.61 GiB.

Qwen3.6 Q5 is about 24.64 GiB.

That sounds like a small difference, but for a 24GB RTX 4090, it is the difference between clean-fit performance and VRAM-cliff behavior.

Qwen3.6 Q5 Real-Prompt Results

Machine	Prompt	Prompt Tokens	Elapsed	Prompt tok/s	Decode tok/s
GB10	Article synthesis	4,831	5.5s	2139.9	58.9
RTX 4090	Article synthesis	4,831	56.9s	120.1	11.6
GB10	Code review	9,659	7.5s	2393.4	57.0
RTX 4090	Code review	9,659	97.9s	114.6	14.2
GB10	Agent trace	10,383	7.8s	2434.6	56.6
RTX 4090	Agent trace	10,383	105.3s	114.0	13.7

This flips the benchmark completely.

At Q4:

The RTX 4090 beats GB10 by roughly 2x to 3.3x.

At Q5:

The RTX 4090 becomes dramatically slower than GB10.

RTX 4090 Q5 vs GB10 Q5:

Prompt	RTX 4090 Prompt Speed Relative to GB10	RTX 4090 Decode Speed Relative to GB10	Wall-Clock Relative
Article synthesis	0.06x	0.20x	0.10x
Code review	0.05x	0.25x	0.08x
Agent trace	0.05x	0.24x	0.07x

In plain English:

On Q5, the RTX 4090 was about 10x to 14x slower in wall-clock time than GB10.

After the RTX 4090 Q5 prompt suite, nvidia-smi showed:

GPU: NVIDIA GeForce RTX 4090
Memory used: 24,079 MiB
Memory free: 60 MiB
GPU utilization: 99%
Power draw: 101 W

The Q5 model technically ran, so it would be inaccurate to say it simply failed.

But it was not a good daily-driver configuration.

It hit the practical VRAM boundary and performance collapsed:

Configuration	Decode Speed
RTX 4090 Q4	~148–152 tok/s
RTX 4090 Q5	~11.6–14.2 tok/s
GB10 Q5	~56.6–58.9 tok/s

That is the clearest result in the entire comparison.

Same model family. Same machines. Different quantization level.

Q4 fits the RTX 4090 cleanly, so the 4090 wins.

Q5 crosses the practical VRAM boundary, so GB10 wins.

That is the local AI VRAM cliff.

Dense 70B: Where GB10’s Memory Matters

I also tested Llama-3.3-70B-Instruct Q4_K_M on GB10.

That model is about 42.5GB on disk. It is not a clean full-GPU fit for a 24GB RTX 4090 before even considering KV cache and runtime overhead.

GB10 Llama 3.3 70B Q4 results:

Workload	PP tok/s	TG tok/s	Mixed tok/s
short_chat	365.0	4.6	9.1
standard_assistant	356.9	4.7	9.1
coding_edit	363.1	4.7	21.9
rag_long	363.3	4.7	105.5
agent_state	365.8	4.7	151.3

This proves the memory side, but it also exposes another practical truth:

Just because a machine can run a 70B model does not mean that model is the best daily-driver choice.

GB10 can run dense 70B Q4, but decode was only around 4.6–4.7 tok/s.

That is usable for experiments, batch jobs, and quality comparisons, but it is not a fast interactive experience.

On GB10, Qwen3.6 35B Q4 was far more practical:

Model on GB10	Decode Speed
Qwen3.6 Q4	~64–65 tok/s
Llama 3.3 70B Q4	~4.6–4.7 tok/s

GB10’s memory is valuable, but bigger is not automatically better for daily use.

Practical Interpretation

This comparison breaks into three regimes.

Regime 1: The model fits cleanly in 24GB VRAM

Winner: RTX 4090

This is Qwen3.6 Q4/MXFP4 territory.

The RTX 4090 is dramatically faster:

~2.5x faster decode
~3x faster prompt ingestion
~2.5x–3.3x faster wall-clock real prompt serving

For chat, coding, RAG, and agent workloads built around Qwen3.6 Q4/MXFP4, the RTX 4090 is the better performance machine.

Regime 2: The model technically starts but hits the VRAM cliff

Winner: GB10

This is Qwen3.6 Q5 territory.

The RTX 4090 technically served Q5, but performance collapsed:

Q4 on RTX 4090:
~150 tok/s decode

Q5 on RTX 4090:
~12–14 tok/s decode

GB10 stayed usable:

Q5 on GB10:
~57 tok/s decode

This is the strongest buyer-relevant finding.

The RTX 4090 is not just slightly worse when it crosses the memory boundary. It can go from excellent to poor very quickly.

Regime 3: The model is clearly beyond 24GB VRAM

Winner: GB10 for fit, not necessarily speed

This is dense 70B Q4, Qwen3.6 Q8, and BF16 territory.

The RTX 4090 cannot keep these workloads fully inside 24GB VRAM.

GB10 can run them, but speed depends heavily on the model architecture and quantization.

Dense 70B Q4 on GB10 was only around 4.7 tok/s decode, so that result is more about capability than daily-driver performance.

Recommended Model Choices

Best Qwen3.6 quant for RTX 4090

Qwen3.6-35B-A3B Q4 or MXFP4

These fit cleanly and perform extremely well.

Qwen3.6 Q4 on RTX 4090:

~148–168 tok/s decode depending on test
~7K–8.5K tok/s long-prompt ingestion

This is a strong daily-driver local model setup.

Avoid Qwen3.6 Q5 as a 4090 daily driver

Q5 technically ran, but the result was poor:

~114–120 tok/s prompt ingestion
~11.6–14.2 tok/s decode
24,079 MiB VRAM used
60 MiB VRAM free

That is the VRAM cliff.

For an RTX 4090, Q5 is not worth it unless the specific goal is to test offload behavior.

Best Qwen3.6 quant for GB10

GB10 can run both Q4 and Q5 cleanly.

Q4 is faster:

~60–65 tok/s decode
~2.5K tok/s real prompt ingestion

Q5 is slower but still usable:

~56–59 tok/s decode
~2.1K–2.4K tok/s real prompt ingestion

On GB10, the choice is more about quality-versus-speed tradeoff. Unlike the 4090, Q5 does not trigger a catastrophic performance cliff.

Dense 70B on GB10

GB10 can run dense 70B, but expect slow decode:

~4.6–4.7 tok/s

Useful for:

Local 70B experiments
Batch jobs
Quality comparisons
Memory-bound testing

Not ideal for:

Fast interactive chat
Rapid coding loops
Latency-sensitive agents

Buyer Recommendations

Choose RTX 4090 if:

Your target models fit cleanly in 24GB VRAM
You care about maximum tokens per second
You want fast chat, coding, RAG, and agent loops
You are happy with Q4/MXFP4 quantization
You want strong performance per dollar

For Qwen3.6 35B Q4/MXFP4, the RTX 4090 is the clear winner.

Choose GB10 if:

You need memory headroom more than peak throughput
You want to run larger quants like Q5, Q8, or BF16
You want to test dense 70B-class models locally
You care about avoiding VRAM-cliff behavior
You run memory-heavy experiments, long contexts, or multiple local AI components

GB10 is not faster on Qwen3.6 Q4, but it is much more forgiving when the workload gets bigger.

The simplest rule:

If it fits cleanly in 24GB VRAM, the RTX 4090 is faster. If it does not fit cleanly, GB10 becomes much more attractive.

The sharper version from this test:

Qwen3.6 Q4 is a 4090 model. Qwen3.6 Q5 is a GB10 model.

Final Comparison Table

Scenario	GB10	RTX 4090	Winner
Qwen3.6 Q4 short chat decode	~63.9 tok/s	~160.5 tok/s	RTX 4090
Qwen3.6 Q4 code review real prompt prefill	2523.6 tok/s	8457.2 tok/s	RTX 4090
Qwen3.6 Q4 code review real prompt decode	60.8 tok/s	147.5 tok/s	RTX 4090
Qwen3.6 Q4 64K mixed throughput	2031.0 tok/s	6656.8 tok/s	RTX 4090
Qwen3.6 Q5 article real prompt prefill	2139.9 tok/s	120.1 tok/s	GB10
Qwen3.6 Q5 article real prompt decode	58.9 tok/s	11.6 tok/s	GB10
Qwen3.6 Q5 code review real prompt decode	57.0 tok/s	14.2 tok/s	GB10
Qwen3.6 Q5 memory state	Lots of headroom	24,079 MiB used / 60 MiB free	GB10
Llama 3.3 70B Q4	Runs, ~4.7 tok/s	Not a clean 24GB fit	GB10 for fit

Caveats

This is a practical local inference comparison, not an official benchmark submission.

Important caveats:

The two machines used different llama.cpp builds. GB10 used a local source build. RTX 4090 used a Windows CUDA prebuilt binary.
The benchmark focuses on throughput and fit, not model quality. This does not prove Q4, Q5, MXFP4, or 70B are better or worse in reasoning quality. It measures practical serving behavior.
Qwen3.6 Q5 did not fail on the RTX 4090. It technically ran, but performance collapsed and VRAM was effectively exhausted. The correct framing is not “impossible.” The correct framing is “not a good clean-fit daily-driver configuration.”
Real prompt labels are approximate workload names. The measured prompt tokens were about 4.8K, 9.7K, and 10.4K. They are realistic long-prompt tasks, not exact 8K, 16K, and 32K prompts.
Dense 70B was not treated as a fair full-GPU RTX 4090 benchmark. A 42.5GB Q4 model is beyond the 4090’s 24GB VRAM. Partial offload is a different workload class.

Conclusion

The RTX 4090 is the better local AI machine when the model fits cleanly in 24GB of VRAM.

For Qwen3.6-35B-A3B Q4 and MXFP4, it is not close. The 4090 delivered roughly 2.5x faster decode, around 3x faster long-prompt ingestion, and more than 2x faster real-world wall-clock prompt serving than GB10.

It also handled Qwen3.6 Q4/MXFP4 at 64K context in this setup, which is an important result. The 4090 is not just good for short prompts. With the right quantization, it is a very strong long-context local inference box.

But the 4090’s advantage depends on staying inside the VRAM envelope.

Qwen3.6 Q5 exposed the cliff. The model technically ran on the 4090, but with only about 60 MiB of VRAM free, performance collapsed to around 12–14 tok/s decode. GB10 ran the same Q5 workload at around 57 tok/s decode with plenty of memory headroom.

That is the core lesson:

The RTX 4090 is fastest when it fits. GB10 is better when fit becomes the problem.

For daily local AI on a 4090, Qwen3.6 Q4 or MXFP4 is the practical choice.

For larger quants, dense 70B experiments, Q8/BF16 runs, and memory-heavy workflows, GB10 becomes much more compelling.

The best local AI machine is not the one with the biggest memory or the fastest GPU in isolation.

It is the one whose memory envelope matches the model you actually want to run.