RTX 4090 vs NVIDIA GB10 for Local AI: Speed When It Fits, Headroom When It Doesn’t
A practical RTX 4090 vs NVIDIA GB10 local AI benchmark comparing Qwen3.6 35B Q4, Q5, MXFP4, long-context workloads, and dense 70B inference. The results show where the RTX 4090 dominates, where GB10’s memory headroom matters, and why the VRAM cliff is one of the most important limits in local AI hardware.

Local AI hardware comparisons usually collapse into one number: tokens per second.
That number matters, but it no longer tells the whole story.
Modern local AI workloads are not just simple 7B chatbot demos. They now include long-context coding assistants, RAG pipelines, autonomous agents, document extraction systems, OCR-plus-LLM workflows, and 30B-to-70B class models.
For those workloads, the better question is not simply:
Which GPU is faster?
The better question is:
Which machine gives the better local AI experience for the models, quantization levels, and context sizes you actually want to run?
To test that, I compared two very different local AI systems:
| System | Accelerator | Memory Profile |
|---|---|---|
| RTX 4090 workstation | NVIDIA GeForce RTX 4090 | 24GB VRAM |
| NVIDIA GB10 system | NVIDIA GB10 | ~124GB CUDA-visible memory reported by llama.cpp |
The short version is simple:
The RTX 4090 is brutally fast when the model fits cleanly in 24GB of VRAM.
GB10 is slower on clean-fit workloads, but becomes far more useful once model size, quantization level, or context length pushes beyond the 4090’s memory comfort zone.
That became obvious in the most important result of the test: Qwen3.6 35B Q4 vs Q5.
At Q4, the RTX 4090 crushed GB10.
At Q5, the RTX 4090 technically ran the model, but performance collapsed because it hit the practical VRAM wall. GB10 stayed stable and usable.
That is the real story of local AI hardware in 2026:
The RTX 4090 is fastest when it fits. GB10 is better when fit becomes the problem.
Test Systems
| System | GPU / Accelerator | Memory | Notes |
|---|---|---|---|
| GB10 | NVIDIA GB10 | ~124,546 MiB CUDA-visible memory | ARM/aarch64, CUDA 13.0 |
| RTX 4090 workstation | NVIDIA GeForce RTX 4090 | 24,563 MiB VRAM | Windows 10 Pro, i9-14900K, ~128GB system RAM |
The important distinction is that the RTX 4090 workstation has plenty of system RAM, but only 24GB of GPU VRAM.
For local LLM inference, that boundary matters.
If the model, KV cache, and runtime buffers fit cleanly inside VRAM, the 4090 is extremely fast. If they do not, performance can collapse quickly.
GB10 does not match the 4090’s raw throughput on smaller clean-fit workloads, but it has far more accelerator-visible memory. That makes it much more forgiving for larger models, heavier quantization levels, long context tests, and memory-heavy experiments.
Software Setup
Both systems used llama.cpp.
GB10
llama.cpp commit: cce09f0b2b37
CUDA: 13.0
Main flags:
-ngl 99
-fa 1 / -fa on
-ctk f16
-ctv f16
RTX 4090
llama.cpp Windows CUDA prebuilt binary
CUDA backend active
Main flags:
-ngl 99
-fa 1 / -fa on
-ctk f16
-ctv f16
The 4090 used a Windows CUDA prebuilt binary, while GB10 used a source build. This should be viewed as a practical workstation comparison, not a formal microarchitecture benchmark.
Still, the differences were large enough that the practical conclusions are clear.
Models Tested
The main model family tested was:
Qwen3.6-35B-A3B
Quantized variants:
| Model | Approx. Size | Practical Meaning |
|---|---|---|
| Qwen3.6 MXFP4 | ~20.22 GiB | Fast, 4090-friendly |
| Qwen3.6 UD-Q4_K_M | ~20.61 GiB | Practical 4090 sweet spot |
| Qwen3.6 UD-Q5_K_M | ~24.64 GiB | Crosses the 4090 comfort zone |
| Qwen3.6 Q8_0 | ~34.37 GiB | GB10/headroom territory |
| Qwen3.6 BF16 shards | ~64.62 GiB total | GB10/headroom territory |
I also tested a larger memory-wall model:
Llama-3.3-70B-Instruct Q4_K_M
Approx. size: ~42.5GB
That dense 70B model is not a clean full-GPU fit for a 24GB RTX 4090. It can be attempted with partial offload, but that is not the same workload class as a model that fits cleanly in VRAM.
Benchmark Workloads
I used three categories of testing:
- Synthetic llama-bench throughput tests
- Long-context / KV-cache pressure tests
- Real llama-server prompt tests
The synthetic matrix used prompt and generation sizes such as:
| Workload | Prompt / Generation | What It Approximates |
|---|---|---|
| short_chat | 256 / 256 | Interactive chat |
| standard_assistant | 512 / 512 | Normal assistant response |
| coding_edit | 2048 / 512 | Coding or document context |
| rag_long | 8192 / 256 | RAG prompt ingestion |
| agent_state | 16384 / 256 | Agent scratchpad / long tool trace |
Additional long-context stress tests used:
32768 / 128
65536 / 128
Real-world prompt tests used actual llama-server /completion calls with:
temperature: 0
generated tokens: 192
ignore_eos: true
KV cache: f16
The real prompt categories were:
article_synthesis_real_8k
code_review_real_16k
agent_trace_real_32k
The labels describe the workload category. The measured prompt token counts are shown later.
Headline Result
For Qwen3.6 35B Q4/MXFP4, the RTX 4090 is dramatically faster than GB10.
| Machine | Model | Best Prefill | Best Decode | Best Mixed |
|---|---|---|---|---|
| GB10 | Qwen3.6 MXFP4 | 2458.4 tok/s | 64.6 tok/s | 1584.7 tok/s |
| GB10 | Qwen3.6 Q4 | 2291.7 tok/s | 65.1 tok/s | 1533.3 tok/s |
| RTX 4090 | Qwen3.6 MXFP4 | 7508.0 tok/s | 168.0 tok/s | 4879.8 tok/s |
| RTX 4090 | Qwen3.6 Q4 | 7429.1 tok/s | 168.7 tok/s | 4900.9 tok/s |
In simple terms:
| Model | RTX 4090 Decode | GB10 Decode |
|---|---|---|
| Qwen3.6 Q4/MXFP4 | ~160–168 tok/s | ~64–65 tok/s |
When Qwen3.6 35B fits cleanly inside 24GB of VRAM, the RTX 4090 wins decisively.
Qwen3.6 MXFP4 Results
| Machine | Workload | PP tok/s | TG tok/s | Mixed tok/s |
|---|---|---|---|---|
| GB10 | short_chat | 2023.3 | 64.4 | 126.4 |
| GB10 | standard_assistant | 1996.9 | 63.9 | 125.4 |
| GB10 | coding_edit | 2458.1 | 63.5 | 293.2 |
| GB10 | rag_long | 2458.4 | 64.1 | 1183.7 |
| GB10 | agent_state | 2439.4 | 64.6 | 1584.7 |
| RTX 4090 | short_chat | 5150.1 | 168.0 | 330.0 |
| RTX 4090 | standard_assistant | 5171.4 | 167.7 | 328.0 |
| RTX 4090 | coding_edit | 7508.0 | 164.0 | 783.1 |
| RTX 4090 | rag_long | 7204.3 | 164.8 | 3487.3 |
| RTX 4090 | agent_state | 7434.5 | 159.2 | 4879.8 |
RTX 4090 speedup over GB10:
| Workload | Decode Speedup | Prefill Speedup | Mixed Speedup |
|---|---|---|---|
| short_chat | 2.61x | 2.55x | 2.61x |
| standard_assistant | 2.63x | 2.59x | 2.62x |
| coding_edit | 2.58x | 3.05x | 2.67x |
| rag_long | 2.57x | 2.93x | 2.95x |
| agent_state | 2.47x | 3.05x | 3.08x |
Qwen3.6 Q4 Results
| Machine | Workload | PP tok/s | TG tok/s | Mixed tok/s |
|---|---|---|---|---|
| GB10 | short_chat | 1836.1 | 63.9 | 125.5 |
| GB10 | standard_assistant | 1828.6 | 63.9 | 125.2 |
| GB10 | coding_edit | 2288.1 | 65.1 | 292.2 |
| GB10 | rag_long | 2291.7 | 63.9 | 1155.1 |
| GB10 | agent_state | 2284.2 | 65.0 | 1533.3 |
| RTX 4090 | short_chat | 5138.8 | 160.5 | 333.6 |
| RTX 4090 | standard_assistant | 5158.7 | 160.2 | 328.0 |
| RTX 4090 | coding_edit | 7394.2 | 161.2 | 792.1 |
| RTX 4090 | rag_long | 7247.0 | 168.7 | 3524.3 |
| RTX 4090 | agent_state | 7429.1 | 163.8 | 4900.9 |
RTX 4090 speedup over GB10:
| Workload | Decode Speedup | Prefill Speedup | Mixed Speedup |
|---|---|---|---|
| short_chat | 2.51x | 2.80x | 2.66x |
| standard_assistant | 2.51x | 2.82x | 2.62x |
| coding_edit | 2.47x | 3.23x | 2.71x |
| rag_long | 2.64x | 3.16x | 3.05x |
| agent_state | 2.52x | 3.25x | 3.20x |
This is the clean-fit 4090 story:
Qwen3.6 35B Q4 fits. The RTX 4090 is 2.5x to 3.25x faster.
Long-Context Stress: 32K and 64K
One important question was whether the RTX 4090 would fall over once context length increased.
Surprisingly, Qwen3.6 Q4 and MXFP4 still ran successfully on the RTX 4090 at 64K prompt length with f16 KV cache.
That matters.
The RTX 4090 is not only fast at short context. For Qwen3.6 Q4/MXFP4, it stayed fast even at 64K.
| Machine | Model | Prompt / Gen | PP tok/s | TG tok/s | Mixed tok/s |
|---|---|---|---|---|---|
| GB10 | Qwen3.6 MXFP4 | 32768 / 128 | 2391.5 | 63.5 | 2184.4 |
| RTX 4090 | Qwen3.6 MXFP4 | 32768 / 128 | 7504.6 | 158.7 | 6969.9 |
| GB10 | Qwen3.6 MXFP4 | 65536 / 128 | 2465.5 | 64.6 | 2124.0 |
| RTX 4090 | Qwen3.6 MXFP4 | 65536 / 128 | 7439.4 | 159.3 | 6680.4 |
| GB10 | Qwen3.6 Q4 | 32768 / 128 | 2320.4 | 62.1 | 2074.0 |
| RTX 4090 | Qwen3.6 Q4 | 32768 / 128 | 7276.9 | 164.2 | 6975.2 |
| GB10 | Qwen3.6 Q4 | 65536 / 128 | 2182.2 | 64.4 | 2031.0 |
| RTX 4090 | Qwen3.6 Q4 | 65536 / 128 | 7399.7 | 161.9 | 6656.8 |
Speedup at long context:
| Model | Workload | Prefill Speedup | Decode Speedup | Mixed Speedup |
|---|---|---|---|---|
| Qwen3.6 Q4 | 32K | 3.14x | 2.64x | 3.36x |
| Qwen3.6 Q4 | 64K | 3.39x | 2.51x | 3.28x |
| Qwen3.6 MXFP4 | 32K | 3.14x | 2.50x | 3.19x |
| Qwen3.6 MXFP4 | 64K | 3.02x | 2.46x | 3.15x |
The key nuance:
The RTX 4090 can run Qwen3.6 35B Q4/MXFP4 even at 64K context in this setup, but it is close to the VRAM limit.
During a Qwen3.6 Q4 64K power and memory observation on the RTX 4090:
| Metric | RTX 4090 |
|---|---|
| Elapsed time | 44.6 sec |
| Prefill | 7451.9 tok/s |
| Decode | 168.9 tok/s |
| Mixed | 6651.0 tok/s |
| Avg active power | 340.7 W |
| Max power | 453.8 W |
| Max observed GPU memory | 23,909 MiB |
| Mixed tok/s/watt | 19.52 |
| Decode tok/s/watt | 0.496 |
That is almost the entire 24GB card.
So the correct conclusion is not:
The 4090 runs out of memory on long context.
The better conclusion is:
Qwen3.6 Q4/MXFP4 are excellent RTX 4090 fits, even at 64K context, but they are near the edge of the 24GB envelope.
Real-World Prompt Tests
Synthetic benchmarks are useful, but I also wanted to see real prompt behavior through llama-server.
The real prompt suite used three practical workloads:
| Prompt | Measured Prompt Tokens | Task Type |
|---|---|---|
| article_synthesis_real_8k | 4,831 | Article / research-note synthesis |
| code_review_real_16k | 9,659 | Backend code review |
| agent_trace_real_32k | 10,383 | Long agent trace recovery |
Each request generated 192 tokens.
Qwen3.6 Q4 Real-Prompt Results
| Machine | Prompt | Prompt Tokens | Elapsed | Prompt tok/s | Decode tok/s |
|---|---|---|---|---|---|
| GB10 | Article synthesis | 4,831 | 5.1s | 2566.5 | 63.1 |
| RTX 4090 | Article synthesis | 4,831 | 2.2s | 5245.6 | 151.7 |
| GB10 | Code review | 9,659 | 7.1s | 2523.6 | 60.8 |
| RTX 4090 | Code review | 9,659 | 2.6s | 8457.2 | 147.5 |
| GB10 | Agent trace | 10,383 | 7.5s | 2557.6 | 60.3 |
| RTX 4090 | Agent trace | 10,383 | 2.7s | 8492.8 | 150.8 |
RTX 4090 speedup over GB10 on Q4:
| Prompt | Prompt-Ingest Speedup | Decode Speedup | Wall-Clock Speedup |
|---|---|---|---|
| Article synthesis | 2.04x | 2.41x | 2.28x |
| Code review | 3.35x | 2.43x | 2.80x |
| Agent trace | 3.32x | 2.50x | 2.82x |
This confirms the synthetic benchmarks in a more realistic serving path.
Qwen3.6 Q4 is a very strong RTX 4090 daily-driver model.
The RTX 4090 ingested real long prompts at up to roughly 8.5K tok/s and decoded around 148–152 tok/s.
GB10 handled the same real prompts at around 2.5K–2.6K tok/s prefill and 60–63 tok/s decode.
The VRAM Cliff: Qwen3.6 Q4 vs Q5
This was the most important test.
Qwen3.6 Q4 is about 20.61 GiB.
Qwen3.6 Q5 is about 24.64 GiB.
That sounds like a small difference, but for a 24GB RTX 4090, it is the difference between clean-fit performance and VRAM-cliff behavior.
Qwen3.6 Q5 Real-Prompt Results
| Machine | Prompt | Prompt Tokens | Elapsed | Prompt tok/s | Decode tok/s |
|---|---|---|---|---|---|
| GB10 | Article synthesis | 4,831 | 5.5s | 2139.9 | 58.9 |
| RTX 4090 | Article synthesis | 4,831 | 56.9s | 120.1 | 11.6 |
| GB10 | Code review | 9,659 | 7.5s | 2393.4 | 57.0 |
| RTX 4090 | Code review | 9,659 | 97.9s | 114.6 | 14.2 |
| GB10 | Agent trace | 10,383 | 7.8s | 2434.6 | 56.6 |
| RTX 4090 | Agent trace | 10,383 | 105.3s | 114.0 | 13.7 |
This flips the benchmark completely.
At Q4:
The RTX 4090 beats GB10 by roughly 2x to 3.3x.
At Q5:
The RTX 4090 becomes dramatically slower than GB10.
RTX 4090 Q5 vs GB10 Q5:
| Prompt | RTX 4090 Prompt Speed Relative to GB10 | RTX 4090 Decode Speed Relative to GB10 | Wall-Clock Relative |
|---|---|---|---|
| Article synthesis | 0.06x | 0.20x | 0.10x |
| Code review | 0.05x | 0.25x | 0.08x |
| Agent trace | 0.05x | 0.24x | 0.07x |
In plain English:
On Q5, the RTX 4090 was about 10x to 14x slower in wall-clock time than GB10.
After the RTX 4090 Q5 prompt suite, nvidia-smi showed:
GPU: NVIDIA GeForce RTX 4090
Memory used: 24,079 MiB
Memory free: 60 MiB
GPU utilization: 99%
Power draw: 101 W
The Q5 model technically ran, so it would be inaccurate to say it simply failed.
But it was not a good daily-driver configuration.
It hit the practical VRAM boundary and performance collapsed:
| Configuration | Decode Speed |
|---|---|
| RTX 4090 Q4 | ~148–152 tok/s |
| RTX 4090 Q5 | ~11.6–14.2 tok/s |
| GB10 Q5 | ~56.6–58.9 tok/s |
That is the clearest result in the entire comparison.
Same model family. Same machines. Different quantization level.
Q4 fits the RTX 4090 cleanly, so the 4090 wins.
Q5 crosses the practical VRAM boundary, so GB10 wins.
That is the local AI VRAM cliff.
Dense 70B: Where GB10’s Memory Matters
I also tested Llama-3.3-70B-Instruct Q4_K_M on GB10.
That model is about 42.5GB on disk. It is not a clean full-GPU fit for a 24GB RTX 4090 before even considering KV cache and runtime overhead.
GB10 Llama 3.3 70B Q4 results:
| Workload | PP tok/s | TG tok/s | Mixed tok/s |
|---|---|---|---|
| short_chat | 365.0 | 4.6 | 9.1 |
| standard_assistant | 356.9 | 4.7 | 9.1 |
| coding_edit | 363.1 | 4.7 | 21.9 |
| rag_long | 363.3 | 4.7 | 105.5 |
| agent_state | 365.8 | 4.7 | 151.3 |
This proves the memory side, but it also exposes another practical truth:
Just because a machine can run a 70B model does not mean that model is the best daily-driver choice.
GB10 can run dense 70B Q4, but decode was only around 4.6–4.7 tok/s.
That is usable for experiments, batch jobs, and quality comparisons, but it is not a fast interactive experience.
On GB10, Qwen3.6 35B Q4 was far more practical:
| Model on GB10 | Decode Speed |
|---|---|
| Qwen3.6 Q4 | ~64–65 tok/s |
| Llama 3.3 70B Q4 | ~4.6–4.7 tok/s |
GB10’s memory is valuable, but bigger is not automatically better for daily use.
Practical Interpretation
This comparison breaks into three regimes.
Regime 1: The model fits cleanly in 24GB VRAM
Winner: RTX 4090
This is Qwen3.6 Q4/MXFP4 territory.
The RTX 4090 is dramatically faster:
~2.5x faster decode
~3x faster prompt ingestion
~2.5x–3.3x faster wall-clock real prompt serving
For chat, coding, RAG, and agent workloads built around Qwen3.6 Q4/MXFP4, the RTX 4090 is the better performance machine.
Regime 2: The model technically starts but hits the VRAM cliff
Winner: GB10
This is Qwen3.6 Q5 territory.
The RTX 4090 technically served Q5, but performance collapsed:
Q4 on RTX 4090:
~150 tok/s decode
Q5 on RTX 4090:
~12–14 tok/s decode
GB10 stayed usable:
Q5 on GB10:
~57 tok/s decode
This is the strongest buyer-relevant finding.
The RTX 4090 is not just slightly worse when it crosses the memory boundary. It can go from excellent to poor very quickly.
Regime 3: The model is clearly beyond 24GB VRAM
Winner: GB10 for fit, not necessarily speed
This is dense 70B Q4, Qwen3.6 Q8, and BF16 territory.
The RTX 4090 cannot keep these workloads fully inside 24GB VRAM.
GB10 can run them, but speed depends heavily on the model architecture and quantization.
Dense 70B Q4 on GB10 was only around 4.7 tok/s decode, so that result is more about capability than daily-driver performance.
Recommended Model Choices
Best Qwen3.6 quant for RTX 4090
Qwen3.6-35B-A3B Q4 or MXFP4
These fit cleanly and perform extremely well.
Qwen3.6 Q4 on RTX 4090:
~148–168 tok/s decode depending on test
~7K–8.5K tok/s long-prompt ingestion
This is a strong daily-driver local model setup.
Avoid Qwen3.6 Q5 as a 4090 daily driver
Q5 technically ran, but the result was poor:
~114–120 tok/s prompt ingestion
~11.6–14.2 tok/s decode
24,079 MiB VRAM used
60 MiB VRAM free
That is the VRAM cliff.
For an RTX 4090, Q5 is not worth it unless the specific goal is to test offload behavior.
Best Qwen3.6 quant for GB10
GB10 can run both Q4 and Q5 cleanly.
Q4 is faster:
~60–65 tok/s decode
~2.5K tok/s real prompt ingestion
Q5 is slower but still usable:
~56–59 tok/s decode
~2.1K–2.4K tok/s real prompt ingestion
On GB10, the choice is more about quality-versus-speed tradeoff. Unlike the 4090, Q5 does not trigger a catastrophic performance cliff.
Dense 70B on GB10
GB10 can run dense 70B, but expect slow decode:
~4.6–4.7 tok/s
Useful for:
- Local 70B experiments
- Batch jobs
- Quality comparisons
- Memory-bound testing
Not ideal for:
- Fast interactive chat
- Rapid coding loops
- Latency-sensitive agents
Buyer Recommendations
Choose RTX 4090 if:
- Your target models fit cleanly in 24GB VRAM
- You care about maximum tokens per second
- You want fast chat, coding, RAG, and agent loops
- You are happy with Q4/MXFP4 quantization
- You want strong performance per dollar
For Qwen3.6 35B Q4/MXFP4, the RTX 4090 is the clear winner.
Choose GB10 if:
- You need memory headroom more than peak throughput
- You want to run larger quants like Q5, Q8, or BF16
- You want to test dense 70B-class models locally
- You care about avoiding VRAM-cliff behavior
- You run memory-heavy experiments, long contexts, or multiple local AI components
GB10 is not faster on Qwen3.6 Q4, but it is much more forgiving when the workload gets bigger.
The simplest rule:
If it fits cleanly in 24GB VRAM, the RTX 4090 is faster. If it does not fit cleanly, GB10 becomes much more attractive.
The sharper version from this test:
Qwen3.6 Q4 is a 4090 model. Qwen3.6 Q5 is a GB10 model.
Final Comparison Table
| Scenario | GB10 | RTX 4090 | Winner |
|---|---|---|---|
| Qwen3.6 Q4 short chat decode | ~63.9 tok/s | ~160.5 tok/s | RTX 4090 |
| Qwen3.6 Q4 code review real prompt prefill | 2523.6 tok/s | 8457.2 tok/s | RTX 4090 |
| Qwen3.6 Q4 code review real prompt decode | 60.8 tok/s | 147.5 tok/s | RTX 4090 |
| Qwen3.6 Q4 64K mixed throughput | 2031.0 tok/s | 6656.8 tok/s | RTX 4090 |
| Qwen3.6 Q5 article real prompt prefill | 2139.9 tok/s | 120.1 tok/s | GB10 |
| Qwen3.6 Q5 article real prompt decode | 58.9 tok/s | 11.6 tok/s | GB10 |
| Qwen3.6 Q5 code review real prompt decode | 57.0 tok/s | 14.2 tok/s | GB10 |
| Qwen3.6 Q5 memory state | Lots of headroom | 24,079 MiB used / 60 MiB free | GB10 |
| Llama 3.3 70B Q4 | Runs, ~4.7 tok/s | Not a clean 24GB fit | GB10 for fit |
Caveats
This is a practical local inference comparison, not an official benchmark submission.
Important caveats:
The two machines used different llama.cpp builds. GB10 used a local source build. RTX 4090 used a Windows CUDA prebuilt binary.
The benchmark focuses on throughput and fit, not model quality. This does not prove Q4, Q5, MXFP4, or 70B are better or worse in reasoning quality. It measures practical serving behavior.
Qwen3.6 Q5 did not fail on the RTX 4090. It technically ran, but performance collapsed and VRAM was effectively exhausted. The correct framing is not “impossible.” The correct framing is “not a good clean-fit daily-driver configuration.”
Real prompt labels are approximate workload names. The measured prompt tokens were about 4.8K, 9.7K, and 10.4K. They are realistic long-prompt tasks, not exact 8K, 16K, and 32K prompts.
Dense 70B was not treated as a fair full-GPU RTX 4090 benchmark. A 42.5GB Q4 model is beyond the 4090’s 24GB VRAM. Partial offload is a different workload class.
Conclusion
The RTX 4090 is the better local AI machine when the model fits cleanly in 24GB of VRAM.
For Qwen3.6-35B-A3B Q4 and MXFP4, it is not close. The 4090 delivered roughly 2.5x faster decode, around 3x faster long-prompt ingestion, and more than 2x faster real-world wall-clock prompt serving than GB10.
It also handled Qwen3.6 Q4/MXFP4 at 64K context in this setup, which is an important result. The 4090 is not just good for short prompts. With the right quantization, it is a very strong long-context local inference box.
But the 4090’s advantage depends on staying inside the VRAM envelope.
Qwen3.6 Q5 exposed the cliff. The model technically ran on the 4090, but with only about 60 MiB of VRAM free, performance collapsed to around 12–14 tok/s decode. GB10 ran the same Q5 workload at around 57 tok/s decode with plenty of memory headroom.
That is the core lesson:
The RTX 4090 is fastest when it fits. GB10 is better when fit becomes the problem.
For daily local AI on a 4090, Qwen3.6 Q4 or MXFP4 is the practical choice.
For larger quants, dense 70B experiments, Q8/BF16 runs, and memory-heavy workflows, GB10 becomes much more compelling.
The best local AI machine is not the one with the biggest memory or the fastest GPU in isolation.
It is the one whose memory envelope matches the model you actually want to run.