Research · 15 min read

RTX 4090 vs NVIDIA GB10 for Local AI: Speed When It Fits, Headroom When It Doesn’t

A practical RTX 4090 vs NVIDIA GB10 local AI benchmark comparing Qwen3.6 35B Q4, Q5, MXFP4, long-context workloads, and dense 70B inference. The results show where the RTX 4090 dominates, where GB10’s memory headroom matters, and why the VRAM cliff is one of the most important limits in local AI hardware.

Elena Voss · · Updated May 18, 2026
4090 vs GB10

Local AI hardware comparisons usually collapse into one number: tokens per second.

That number matters, but it no longer tells the whole story.

Modern local AI workloads are not just simple 7B chatbot demos. They now include long-context coding assistants, RAG pipelines, autonomous agents, document extraction systems, OCR-plus-LLM workflows, and 30B-to-70B class models.

For those workloads, the better question is not simply:

Which GPU is faster?

The better question is:

Which machine gives the better local AI experience for the models, quantization levels, and context sizes you actually want to run?

To test that, I compared two very different local AI systems:

SystemAcceleratorMemory Profile
RTX 4090 workstationNVIDIA GeForce RTX 409024GB VRAM
NVIDIA GB10 systemNVIDIA GB10~124GB CUDA-visible memory reported by llama.cpp

The short version is simple:

The RTX 4090 is brutally fast when the model fits cleanly in 24GB of VRAM.

GB10 is slower on clean-fit workloads, but becomes far more useful once model size, quantization level, or context length pushes beyond the 4090’s memory comfort zone.

That became obvious in the most important result of the test: Qwen3.6 35B Q4 vs Q5.

At Q4, the RTX 4090 crushed GB10.

At Q5, the RTX 4090 technically ran the model, but performance collapsed because it hit the practical VRAM wall. GB10 stayed stable and usable.

That is the real story of local AI hardware in 2026:

The RTX 4090 is fastest when it fits. GB10 is better when fit becomes the problem.


Test Systems

SystemGPU / AcceleratorMemoryNotes
GB10NVIDIA GB10~124,546 MiB CUDA-visible memoryARM/aarch64, CUDA 13.0
RTX 4090 workstationNVIDIA GeForce RTX 409024,563 MiB VRAMWindows 10 Pro, i9-14900K, ~128GB system RAM

The important distinction is that the RTX 4090 workstation has plenty of system RAM, but only 24GB of GPU VRAM.

For local LLM inference, that boundary matters.

If the model, KV cache, and runtime buffers fit cleanly inside VRAM, the 4090 is extremely fast. If they do not, performance can collapse quickly.

GB10 does not match the 4090’s raw throughput on smaller clean-fit workloads, but it has far more accelerator-visible memory. That makes it much more forgiving for larger models, heavier quantization levels, long context tests, and memory-heavy experiments.


Software Setup

Both systems used llama.cpp.

GB10

llama.cpp commit: cce09f0b2b37
CUDA: 13.0
Main flags:
-ngl 99
-fa 1 / -fa on
-ctk f16
-ctv f16

RTX 4090

llama.cpp Windows CUDA prebuilt binary
CUDA backend active
Main flags:
-ngl 99
-fa 1 / -fa on
-ctk f16
-ctv f16

The 4090 used a Windows CUDA prebuilt binary, while GB10 used a source build. This should be viewed as a practical workstation comparison, not a formal microarchitecture benchmark.

Still, the differences were large enough that the practical conclusions are clear.


Models Tested

The main model family tested was:

Qwen3.6-35B-A3B

Quantized variants:

ModelApprox. SizePractical Meaning
Qwen3.6 MXFP4~20.22 GiBFast, 4090-friendly
Qwen3.6 UD-Q4_K_M~20.61 GiBPractical 4090 sweet spot
Qwen3.6 UD-Q5_K_M~24.64 GiBCrosses the 4090 comfort zone
Qwen3.6 Q8_0~34.37 GiBGB10/headroom territory
Qwen3.6 BF16 shards~64.62 GiB totalGB10/headroom territory

I also tested a larger memory-wall model:

Llama-3.3-70B-Instruct Q4_K_M
Approx. size: ~42.5GB

That dense 70B model is not a clean full-GPU fit for a 24GB RTX 4090. It can be attempted with partial offload, but that is not the same workload class as a model that fits cleanly in VRAM.


Benchmark Workloads

I used three categories of testing:

  1. Synthetic llama-bench throughput tests
  2. Long-context / KV-cache pressure tests
  3. Real llama-server prompt tests

The synthetic matrix used prompt and generation sizes such as:

WorkloadPrompt / GenerationWhat It Approximates
short_chat256 / 256Interactive chat
standard_assistant512 / 512Normal assistant response
coding_edit2048 / 512Coding or document context
rag_long8192 / 256RAG prompt ingestion
agent_state16384 / 256Agent scratchpad / long tool trace

Additional long-context stress tests used:

32768 / 128
65536 / 128

Real-world prompt tests used actual llama-server /completion calls with:

temperature: 0
generated tokens: 192
ignore_eos: true
KV cache: f16

The real prompt categories were:

article_synthesis_real_8k
code_review_real_16k
agent_trace_real_32k

The labels describe the workload category. The measured prompt token counts are shown later.


Headline Result

For Qwen3.6 35B Q4/MXFP4, the RTX 4090 is dramatically faster than GB10.

MachineModelBest PrefillBest DecodeBest Mixed
GB10Qwen3.6 MXFP42458.4 tok/s64.6 tok/s1584.7 tok/s
GB10Qwen3.6 Q42291.7 tok/s65.1 tok/s1533.3 tok/s
RTX 4090Qwen3.6 MXFP47508.0 tok/s168.0 tok/s4879.8 tok/s
RTX 4090Qwen3.6 Q47429.1 tok/s168.7 tok/s4900.9 tok/s

In simple terms:

ModelRTX 4090 DecodeGB10 Decode
Qwen3.6 Q4/MXFP4~160–168 tok/s~64–65 tok/s

When Qwen3.6 35B fits cleanly inside 24GB of VRAM, the RTX 4090 wins decisively.


Qwen3.6 MXFP4 Results

MachineWorkloadPP tok/sTG tok/sMixed tok/s
GB10short_chat2023.364.4126.4
GB10standard_assistant1996.963.9125.4
GB10coding_edit2458.163.5293.2
GB10rag_long2458.464.11183.7
GB10agent_state2439.464.61584.7
RTX 4090short_chat5150.1168.0330.0
RTX 4090standard_assistant5171.4167.7328.0
RTX 4090coding_edit7508.0164.0783.1
RTX 4090rag_long7204.3164.83487.3
RTX 4090agent_state7434.5159.24879.8

RTX 4090 speedup over GB10:

WorkloadDecode SpeedupPrefill SpeedupMixed Speedup
short_chat2.61x2.55x2.61x
standard_assistant2.63x2.59x2.62x
coding_edit2.58x3.05x2.67x
rag_long2.57x2.93x2.95x
agent_state2.47x3.05x3.08x

Qwen3.6 Q4 Results

MachineWorkloadPP tok/sTG tok/sMixed tok/s
GB10short_chat1836.163.9125.5
GB10standard_assistant1828.663.9125.2
GB10coding_edit2288.165.1292.2
GB10rag_long2291.763.91155.1
GB10agent_state2284.265.01533.3
RTX 4090short_chat5138.8160.5333.6
RTX 4090standard_assistant5158.7160.2328.0
RTX 4090coding_edit7394.2161.2792.1
RTX 4090rag_long7247.0168.73524.3
RTX 4090agent_state7429.1163.84900.9

RTX 4090 speedup over GB10:

WorkloadDecode SpeedupPrefill SpeedupMixed Speedup
short_chat2.51x2.80x2.66x
standard_assistant2.51x2.82x2.62x
coding_edit2.47x3.23x2.71x
rag_long2.64x3.16x3.05x
agent_state2.52x3.25x3.20x

This is the clean-fit 4090 story:

Qwen3.6 35B Q4 fits. The RTX 4090 is 2.5x to 3.25x faster.


Long-Context Stress: 32K and 64K

One important question was whether the RTX 4090 would fall over once context length increased.

Surprisingly, Qwen3.6 Q4 and MXFP4 still ran successfully on the RTX 4090 at 64K prompt length with f16 KV cache.

That matters.

The RTX 4090 is not only fast at short context. For Qwen3.6 Q4/MXFP4, it stayed fast even at 64K.

MachineModelPrompt / GenPP tok/sTG tok/sMixed tok/s
GB10Qwen3.6 MXFP432768 / 1282391.563.52184.4
RTX 4090Qwen3.6 MXFP432768 / 1287504.6158.76969.9
GB10Qwen3.6 MXFP465536 / 1282465.564.62124.0
RTX 4090Qwen3.6 MXFP465536 / 1287439.4159.36680.4
GB10Qwen3.6 Q432768 / 1282320.462.12074.0
RTX 4090Qwen3.6 Q432768 / 1287276.9164.26975.2
GB10Qwen3.6 Q465536 / 1282182.264.42031.0
RTX 4090Qwen3.6 Q465536 / 1287399.7161.96656.8

Speedup at long context:

ModelWorkloadPrefill SpeedupDecode SpeedupMixed Speedup
Qwen3.6 Q432K3.14x2.64x3.36x
Qwen3.6 Q464K3.39x2.51x3.28x
Qwen3.6 MXFP432K3.14x2.50x3.19x
Qwen3.6 MXFP464K3.02x2.46x3.15x

The key nuance:

The RTX 4090 can run Qwen3.6 35B Q4/MXFP4 even at 64K context in this setup, but it is close to the VRAM limit.

During a Qwen3.6 Q4 64K power and memory observation on the RTX 4090:

MetricRTX 4090
Elapsed time44.6 sec
Prefill7451.9 tok/s
Decode168.9 tok/s
Mixed6651.0 tok/s
Avg active power340.7 W
Max power453.8 W
Max observed GPU memory23,909 MiB
Mixed tok/s/watt19.52
Decode tok/s/watt0.496

That is almost the entire 24GB card.

So the correct conclusion is not:

The 4090 runs out of memory on long context.

The better conclusion is:

Qwen3.6 Q4/MXFP4 are excellent RTX 4090 fits, even at 64K context, but they are near the edge of the 24GB envelope.


Real-World Prompt Tests

Synthetic benchmarks are useful, but I also wanted to see real prompt behavior through llama-server.

The real prompt suite used three practical workloads:

PromptMeasured Prompt TokensTask Type
article_synthesis_real_8k4,831Article / research-note synthesis
code_review_real_16k9,659Backend code review
agent_trace_real_32k10,383Long agent trace recovery

Each request generated 192 tokens.

Qwen3.6 Q4 Real-Prompt Results

MachinePromptPrompt TokensElapsedPrompt tok/sDecode tok/s
GB10Article synthesis4,8315.1s2566.563.1
RTX 4090Article synthesis4,8312.2s5245.6151.7
GB10Code review9,6597.1s2523.660.8
RTX 4090Code review9,6592.6s8457.2147.5
GB10Agent trace10,3837.5s2557.660.3
RTX 4090Agent trace10,3832.7s8492.8150.8

RTX 4090 speedup over GB10 on Q4:

PromptPrompt-Ingest SpeedupDecode SpeedupWall-Clock Speedup
Article synthesis2.04x2.41x2.28x
Code review3.35x2.43x2.80x
Agent trace3.32x2.50x2.82x

This confirms the synthetic benchmarks in a more realistic serving path.

Qwen3.6 Q4 is a very strong RTX 4090 daily-driver model.

The RTX 4090 ingested real long prompts at up to roughly 8.5K tok/s and decoded around 148–152 tok/s.

GB10 handled the same real prompts at around 2.5K–2.6K tok/s prefill and 60–63 tok/s decode.


The VRAM Cliff: Qwen3.6 Q4 vs Q5

This was the most important test.

Qwen3.6 Q4 is about 20.61 GiB.

Qwen3.6 Q5 is about 24.64 GiB.

That sounds like a small difference, but for a 24GB RTX 4090, it is the difference between clean-fit performance and VRAM-cliff behavior.

Qwen3.6 Q5 Real-Prompt Results

MachinePromptPrompt TokensElapsedPrompt tok/sDecode tok/s
GB10Article synthesis4,8315.5s2139.958.9
RTX 4090Article synthesis4,83156.9s120.111.6
GB10Code review9,6597.5s2393.457.0
RTX 4090Code review9,65997.9s114.614.2
GB10Agent trace10,3837.8s2434.656.6
RTX 4090Agent trace10,383105.3s114.013.7

This flips the benchmark completely.

At Q4:

The RTX 4090 beats GB10 by roughly 2x to 3.3x.

At Q5:

The RTX 4090 becomes dramatically slower than GB10.

RTX 4090 Q5 vs GB10 Q5:

PromptRTX 4090 Prompt Speed Relative to GB10RTX 4090 Decode Speed Relative to GB10Wall-Clock Relative
Article synthesis0.06x0.20x0.10x
Code review0.05x0.25x0.08x
Agent trace0.05x0.24x0.07x

In plain English:

On Q5, the RTX 4090 was about 10x to 14x slower in wall-clock time than GB10.

After the RTX 4090 Q5 prompt suite, nvidia-smi showed:

GPU: NVIDIA GeForce RTX 4090
Memory used: 24,079 MiB
Memory free: 60 MiB
GPU utilization: 99%
Power draw: 101 W

The Q5 model technically ran, so it would be inaccurate to say it simply failed.

But it was not a good daily-driver configuration.

It hit the practical VRAM boundary and performance collapsed:

ConfigurationDecode Speed
RTX 4090 Q4~148–152 tok/s
RTX 4090 Q5~11.6–14.2 tok/s
GB10 Q5~56.6–58.9 tok/s

That is the clearest result in the entire comparison.

Same model family. Same machines. Different quantization level.

Q4 fits the RTX 4090 cleanly, so the 4090 wins.

Q5 crosses the practical VRAM boundary, so GB10 wins.

That is the local AI VRAM cliff.


Dense 70B: Where GB10’s Memory Matters

I also tested Llama-3.3-70B-Instruct Q4_K_M on GB10.

That model is about 42.5GB on disk. It is not a clean full-GPU fit for a 24GB RTX 4090 before even considering KV cache and runtime overhead.

GB10 Llama 3.3 70B Q4 results:

WorkloadPP tok/sTG tok/sMixed tok/s
short_chat365.04.69.1
standard_assistant356.94.79.1
coding_edit363.14.721.9
rag_long363.34.7105.5
agent_state365.84.7151.3

This proves the memory side, but it also exposes another practical truth:

Just because a machine can run a 70B model does not mean that model is the best daily-driver choice.

GB10 can run dense 70B Q4, but decode was only around 4.6–4.7 tok/s.

That is usable for experiments, batch jobs, and quality comparisons, but it is not a fast interactive experience.

On GB10, Qwen3.6 35B Q4 was far more practical:

Model on GB10Decode Speed
Qwen3.6 Q4~64–65 tok/s
Llama 3.3 70B Q4~4.6–4.7 tok/s

GB10’s memory is valuable, but bigger is not automatically better for daily use.


Practical Interpretation

This comparison breaks into three regimes.

Regime 1: The model fits cleanly in 24GB VRAM

Winner: RTX 4090

This is Qwen3.6 Q4/MXFP4 territory.

The RTX 4090 is dramatically faster:

~2.5x faster decode
~3x faster prompt ingestion
~2.5x–3.3x faster wall-clock real prompt serving

For chat, coding, RAG, and agent workloads built around Qwen3.6 Q4/MXFP4, the RTX 4090 is the better performance machine.


Regime 2: The model technically starts but hits the VRAM cliff

Winner: GB10

This is Qwen3.6 Q5 territory.

The RTX 4090 technically served Q5, but performance collapsed:

Q4 on RTX 4090:
~150 tok/s decode

Q5 on RTX 4090:
~12–14 tok/s decode

GB10 stayed usable:

Q5 on GB10:
~57 tok/s decode

This is the strongest buyer-relevant finding.

The RTX 4090 is not just slightly worse when it crosses the memory boundary. It can go from excellent to poor very quickly.


Regime 3: The model is clearly beyond 24GB VRAM

Winner: GB10 for fit, not necessarily speed

This is dense 70B Q4, Qwen3.6 Q8, and BF16 territory.

The RTX 4090 cannot keep these workloads fully inside 24GB VRAM.

GB10 can run them, but speed depends heavily on the model architecture and quantization.

Dense 70B Q4 on GB10 was only around 4.7 tok/s decode, so that result is more about capability than daily-driver performance.


Recommended Model Choices

Best Qwen3.6 quant for RTX 4090

Qwen3.6-35B-A3B Q4 or MXFP4

These fit cleanly and perform extremely well.

Qwen3.6 Q4 on RTX 4090:

~148–168 tok/s decode depending on test
~7K–8.5K tok/s long-prompt ingestion

This is a strong daily-driver local model setup.

Avoid Qwen3.6 Q5 as a 4090 daily driver

Q5 technically ran, but the result was poor:

~114–120 tok/s prompt ingestion
~11.6–14.2 tok/s decode
24,079 MiB VRAM used
60 MiB VRAM free

That is the VRAM cliff.

For an RTX 4090, Q5 is not worth it unless the specific goal is to test offload behavior.

Best Qwen3.6 quant for GB10

GB10 can run both Q4 and Q5 cleanly.

Q4 is faster:

~60–65 tok/s decode
~2.5K tok/s real prompt ingestion

Q5 is slower but still usable:

~56–59 tok/s decode
~2.1K–2.4K tok/s real prompt ingestion

On GB10, the choice is more about quality-versus-speed tradeoff. Unlike the 4090, Q5 does not trigger a catastrophic performance cliff.

Dense 70B on GB10

GB10 can run dense 70B, but expect slow decode:

~4.6–4.7 tok/s

Useful for:

  • Local 70B experiments
  • Batch jobs
  • Quality comparisons
  • Memory-bound testing

Not ideal for:

  • Fast interactive chat
  • Rapid coding loops
  • Latency-sensitive agents

Buyer Recommendations

Choose RTX 4090 if:

  • Your target models fit cleanly in 24GB VRAM
  • You care about maximum tokens per second
  • You want fast chat, coding, RAG, and agent loops
  • You are happy with Q4/MXFP4 quantization
  • You want strong performance per dollar

For Qwen3.6 35B Q4/MXFP4, the RTX 4090 is the clear winner.

Choose GB10 if:

  • You need memory headroom more than peak throughput
  • You want to run larger quants like Q5, Q8, or BF16
  • You want to test dense 70B-class models locally
  • You care about avoiding VRAM-cliff behavior
  • You run memory-heavy experiments, long contexts, or multiple local AI components

GB10 is not faster on Qwen3.6 Q4, but it is much more forgiving when the workload gets bigger.

The simplest rule:

If it fits cleanly in 24GB VRAM, the RTX 4090 is faster. If it does not fit cleanly, GB10 becomes much more attractive.

The sharper version from this test:

Qwen3.6 Q4 is a 4090 model. Qwen3.6 Q5 is a GB10 model.


Final Comparison Table

ScenarioGB10RTX 4090Winner
Qwen3.6 Q4 short chat decode~63.9 tok/s~160.5 tok/sRTX 4090
Qwen3.6 Q4 code review real prompt prefill2523.6 tok/s8457.2 tok/sRTX 4090
Qwen3.6 Q4 code review real prompt decode60.8 tok/s147.5 tok/sRTX 4090
Qwen3.6 Q4 64K mixed throughput2031.0 tok/s6656.8 tok/sRTX 4090
Qwen3.6 Q5 article real prompt prefill2139.9 tok/s120.1 tok/sGB10
Qwen3.6 Q5 article real prompt decode58.9 tok/s11.6 tok/sGB10
Qwen3.6 Q5 code review real prompt decode57.0 tok/s14.2 tok/sGB10
Qwen3.6 Q5 memory stateLots of headroom24,079 MiB used / 60 MiB freeGB10
Llama 3.3 70B Q4Runs, ~4.7 tok/sNot a clean 24GB fitGB10 for fit

Caveats

This is a practical local inference comparison, not an official benchmark submission.

Important caveats:

  1. The two machines used different llama.cpp builds. GB10 used a local source build. RTX 4090 used a Windows CUDA prebuilt binary.

  2. The benchmark focuses on throughput and fit, not model quality. This does not prove Q4, Q5, MXFP4, or 70B are better or worse in reasoning quality. It measures practical serving behavior.

  3. Qwen3.6 Q5 did not fail on the RTX 4090. It technically ran, but performance collapsed and VRAM was effectively exhausted. The correct framing is not “impossible.” The correct framing is “not a good clean-fit daily-driver configuration.”

  4. Real prompt labels are approximate workload names. The measured prompt tokens were about 4.8K, 9.7K, and 10.4K. They are realistic long-prompt tasks, not exact 8K, 16K, and 32K prompts.

  5. Dense 70B was not treated as a fair full-GPU RTX 4090 benchmark. A 42.5GB Q4 model is beyond the 4090’s 24GB VRAM. Partial offload is a different workload class.


Conclusion

The RTX 4090 is the better local AI machine when the model fits cleanly in 24GB of VRAM.

For Qwen3.6-35B-A3B Q4 and MXFP4, it is not close. The 4090 delivered roughly 2.5x faster decode, around 3x faster long-prompt ingestion, and more than 2x faster real-world wall-clock prompt serving than GB10.

It also handled Qwen3.6 Q4/MXFP4 at 64K context in this setup, which is an important result. The 4090 is not just good for short prompts. With the right quantization, it is a very strong long-context local inference box.

But the 4090’s advantage depends on staying inside the VRAM envelope.

Qwen3.6 Q5 exposed the cliff. The model technically ran on the 4090, but with only about 60 MiB of VRAM free, performance collapsed to around 12–14 tok/s decode. GB10 ran the same Q5 workload at around 57 tok/s decode with plenty of memory headroom.

That is the core lesson:

The RTX 4090 is fastest when it fits. GB10 is better when fit becomes the problem.

For daily local AI on a 4090, Qwen3.6 Q4 or MXFP4 is the practical choice.

For larger quants, dense 70B experiments, Q8/BF16 runs, and memory-heavy workflows, GB10 becomes much more compelling.

The best local AI machine is not the one with the biggest memory or the fastest GPU in isolation.

It is the one whose memory envelope matches the model you actually want to run.

4090gb10local ai