Research · 25 min read

Can Local AI Agents Do Real Office Work?

A practical benchmark comparing a local NVIDIA GB10 AI agent running Qwen3.6 35B through Hermes against GPT-5.5 on real office workflows, including code repair, document synthesis, web edits, travel research, and concurrency limits.

Marcus Callahan · · Updated May 27, 2026
local ai vs cloud based

A practical local model + Hermes benchmark against GPT-5.5

Most local AI benchmarks answer a narrow question: can a model produce the right response to a prompt?

That matters, but it is not the question operators care about most.

If a company is considering local AI hardware, the real question is more practical:

Can a private local AI agent complete the ordinary work that lands on a team during the week?

Not a clean coding puzzle. Not a synthetic chat prompt. Realistic office work: fixing a small business-rule bug, writing a postmortem, cleaning up a weak draft, updating a web page, reading internal notes, citing sources, avoiding sensitive data leaks, and leaving behind files that a human can inspect.

So I built a benchmark around that.

This was not a leaderboard. It was not an official model evaluation. It was a practical workflow trial: a local Qwen3.6 35B model running through Hermes Agent, compared against GPT-5.5 using the same prompts, fixtures, validators, and artifact review process.

The result was clear:

The local machine passed every sequential task. GPT-5.5 was faster and slightly cleaner, but the local model crossed the line from “interesting offline AI demo” to “useful private workflow pilot.”

The concurrency result was more restrictive. One local agent worked reliably. Two concurrent local agents also worked, with higher latency. Four concurrent local agents failed half the jobs in this run.

That matters because buyers do not only ask, “Can it work?” They ask, “How many people can use this system before reliability breaks down?”

That is where this benchmark became useful. It did not just test whether local AI can complete a task. It started to show the practical operating boundary.


Why I ran this test

I have seen plenty of local AI demos that look impressive for five minutes and then fall apart when asked to behave like a worker.

A worker has to use tools. A worker has to read files. A worker has to deal with a half-broken repo, not a perfect prompt. A worker has to write the output in the right place. A worker has to run the validator and revise when it fails. A worker has to know when a source is uncertain. A worker has to avoid exposing private information. A worker has to leave a reviewer with a clean trail.

That is the difference between a model giving a good answer and an agent completing a job.

The local AI question is not just about raw intelligence. It is about control.

If a company can run useful agent workflows locally, it gets a different privacy and cost profile. Internal documents can stay inside the building. Routine work can run without sending every prompt to a cloud provider. The organization can route jobs by sensitivity instead of treating every request the same way.

But privacy alone does not make a failed workflow useful.

So the benchmark had to test full workflow completion, not just model output.


The setup

The local stack was:

  • Qwen3.6 35B GGUF model
  • NVIDIA GB10-class machine used as the local inference host
  • llama.cpp / llama-server behind an OpenAI-compatible API
  • 65,536-token runtime context on the local server
  • Hermes Agent as the tool-using agent framework
  • Hermes skills for planning, test-driven development, debugging, code review, prose cleanup, and flight research

The cloud comparison was:

  • GPT-5.5 through the same Hermes harness
  • Same prompts
  • Same fixture copies
  • Same validators
  • Same scoring structure

The goal was not to prove that the local model could beat GPT-5.5. It did not.

The goal was to test whether a local 35B-class model, when placed inside a real agent harness, could complete bounded company workflows privately, repeatably, and well enough that a business owner should take it seriously.

The underlying run outputs were preserved for internal auditability, including transcripts, diffs, changed-file lists, validation results, and timing summaries.

That matters because benchmark articles can become vague quickly. I wanted the headline results tied back to actual run artifacts, not just copied into a chart. The specific local file paths are not important because they will vary by setup.


Local inference configuration

This is the section I usually look for first in local model comparison posts, so here are the configuration details captured during the run.

The local Hermes profile was gb10-local. It pointed at a custom OpenAI-compatible endpoint:

model:
  default: qwen3.6-35b-gb10
  provider: custom
  base_url: http://127.0.0.1:8087/v1

The model alias qwen3.6-35b-gb10 mapped to:

Qwen3.6-35B-A3B-UD-Q4_K_M.gguf

The earlier pilot notes recorded the local server as llama-server on 127.0.0.1:8087 with these flags:

-ngl 99 -fa on -c 65536 -b 4096 -ub 1024 -ctk f16 -ctv f16

In plain terms:

SettingValue used
ModelQwen3.6-35B-A3B
FileQwen3.6-35B-A3B-UD-Q4_K_M.gguf
QuantizationQ4_K_M
Serving stackllama.cpp / llama-server
API shapeOpenAI-compatible /v1 endpoint
Context configured65,536 tokens
GPU layers99
Flash attentionOn
Batch size4096
Ubatch size1024
KV cachef16 keys, f16 values
Hermes local reasoning settingLow
Hermes local max turns90 profile default, with per-task caps of 45–60 turns

The Qwen GGUF came from the Unsloth Qwen3.6-35B-A3B GGUF set. The benchmark project also had MXFP4, Q4_K_M, Q5_K_M, Q8_0, and BF16 files downloaded for other testing, but this article focuses on the Q4_K_M local profile.

One caveat: I did not capture clean per-request llama-server telemetry for the final 24-job agent run. I have wall-clock task timings, pass/fail results, changed files, transcripts, diffs, and validator outputs. I do not have clean prompt-eval tokens/sec and decode tokens/sec per job.

So the timing numbers below should be treated as end-to-end agent wall-clock, not pure inference speed. They include model time, Hermes tool loops, file reads, shell commands, validation scripts, builds, and web access where allowed.


Hardware and system details

The machine reported itself as an NVIDIA GB10 system running Linux on ARM64.

Snapshot from the test environment:

OS: Ubuntu/Linux, kernel 6.17.0-1014-nvidia
Architecture: aarch64
CPU: 20 cores reported by lscpu
CPU family shown: Cortex-X925 and Cortex-A725
System memory: 121 GiB total
GPU: NVIDIA GB10
GPU architecture: Blackwell
NVIDIA driver: 580.126.09
CUDA version reported by nvidia-smi: 13.0

nvidia-smi did not report normal discrete-GPU memory totals for this platform, so I am treating the GB10 as the host platform for the local model rather than making the article primarily a hardware benchmark.

During the concurrency run, GPU utilization was observed around the mid-90% range. I did not collect reliable memory or KV-cache telemetry.

That limits what I can claim. I can say the box was under heavy compute load during concurrency. I cannot say exactly how much memory headroom remained or whether the four-way failure was caused by memory pressure, server scheduling, model behavior, agent timeouts, or some combination of those factors.


What I measured

I ran two sequential benchmark suites, with two repeats per agent/task.

The first suite was a FieldOps-style internal operations benchmark. It tested work that resembles what happens around a small operations platform:

  • Repair failing business logic
  • Write an incident postmortem
  • Add or repair regression behavior
  • Synthesize private account notes into an executive digest
  • Cite source files
  • Avoid leaking fake PII

The second suite was broader office work:

  • Update a small TypeScript-style microsite
  • Rewrite a weak public blog draft
  • Research MSY to AUS flights for a customer workshop

Together, that produced 24 sequential jobs:

  • 2 agents
  • 6 task types
  • 2 repeats per task
  • 12 local jobs
  • 12 cloud jobs

The task mix was intentional. A real employee does not spend all day on one type of task. They switch between code, writing, research, judgment, cleanup, and explanation. That is where agents become useful, or where they break.


How I scored it

I used two layers of scoring.

The first layer was objective validation:

  • Did the required files exist?
  • Did the tests pass?
  • Did task-specific validators pass?
  • Did the artifact include the required sections?
  • Did the output avoid banned privacy leaks or prohibited behavior?

The second layer was manual artifact quality.

I scored outputs on a 1–5 reviewer scale based on usefulness, source care, caveats, privacy behavior, and how much cleanup a human would need before using the result.

Those two layers are separate on purpose.

A validator can tell you whether the artifact satisfied the contract. It cannot tell you whether the prose has taste. It cannot tell you whether the recommendation feels careful enough for a customer. It cannot tell you whether the agent technically passed while leaving a reviewer with a messy artifact.

That distinction matters for local AI. Passing a validator means the system can be operationally useful. Manual quality tells you how much trust and review it still needs.


What the validators checked

The validators were deliberately simple, but they were not vibes-based. They checked whether the agent produced the required artifact and avoided obvious failure modes.

For the FieldOps code repair task, validation ran the Python unittest suite and checked that the required implementation plan existed at:

docs/plans/privacy_aware_ops_digest_plan.md

For the incident/postmortem task, validation ran the unittest suite and checked that the agent wrote:

docs/incidents/INC-1042-postmortem.md

For the private-docs digest task, validation checked that:

artifacts/ops_executive_digest.md

existed, cited at least 10 docs/ source paths, and did not contain raw @example.com, 555-, or 555 contact details.

For the website task, validation rebuilt the site and checked the generated HTML/CSS for required content and behavior:

  • Required hero text
  • Exactly three cards: Privacy, Throughput, Auditability
  • Correct labor-savings calculation: 42 minus 18 equals 24 weekly hours saved
  • Correct annual savings calculation: 24 × $85 × 52 = $106,080
  • No raw synthetic contact leaks
  • Required call to action
  • Accessibility language about keyboard navigation and focus states
  • Responsive CSS using grid/flex and a media query
  • No overclaims such as “official leaderboard” or “replace whole teams”

For the editorial task, validation checked that artifacts/final_blog.md existed, avoided banned hype phrases, included sections for “what we measured” and “where cloud still wins,” mentioned web tasks, editing tasks, and travel-research tasks, included a practical recommendation, and had at least 450 words.

For the travel task, validation checked that artifacts/customer_workshop_flights.md existed, covered MSY to AUS, cited a search source, discussed prices or explained unavailable fare data, included at least three options, identified “best overall” and “cheapest acceptable,” included an avoid/backup option, mentioned fare volatility, and did not claim that travel was booked.

That is still not production correctness. A validator can miss subtle errors. But it is much better than reading a transcript and saying, “looks good.”


What counted as failure

A job failed if the agent process failed, timed out, missed the required artifact, or failed the validator after the run.

Weak prose alone was not automatically an objective failure. If the editorial output passed the validator but needed polish, that showed up in the manual quality score rather than the pass/fail table.

Privacy leaks were objective failures when the validator checked for them. Booking travel, entering personal data, or claiming a booking would have been a failure for the travel task. Editing generated dist/ output instead of source would have been treated as the wrong kind of solution for the website task.

I do not want to pretend the validators measured everything. They measured task compliance. The manual review measured artifact usefulness.


Headline result

Across the two sequential suites, both agents passed every objective validator.

AgentObjective pass rateAvg wall-clock/jobTotal measured timeAvg manual quality
GB10 local Qwen3.6 35B12/12221.5s44.3 min4.27/5
GPT-5.512/12147.6s29.5 min4.42/5

That is the result in one table.

The local model was slower. It was rougher in a few places. GPT-5.5 wrote cleaner editorial copy and handled synthesis with more confidence. But the local model did the work. It used tools, read files, changed the repo, wrote artifacts, ran checks, and passed.

That is the part I would not have said confidently a year ago.

This does not mean the local setup is better than GPT-5.5. It means the local setup was good enough to complete bounded private workflows in this harness.

That is a smaller claim, but it is also the more useful one.


Task-level results

Here is the per-task timing summary across the repeated sequential suites.

SuiteTaskGB10 local avgGPT-5.5 avgResult
FieldOpsCode repair149.2s122.9sBoth passed
FieldOpsIncident postmortem + regression151.4s97.4sBoth passed
FieldOpsPrivate docs executive digest247.1s114.2sBoth passed
Office workWebsite/microsite repair171.3s152.3sBoth passed
Office workEditorial rewrite111.6s53.6sBoth passed
Office workFlight research memo498.6s345.4sBoth passed

The pattern is straightforward.

The local model was competitive on code-like tasks and bounded web edits. It lagged more on synthesis and writing. The largest gap was private document synthesis, where GPT-5.5 finished in less than half the time and produced stronger citations.

The flight task took the longest for both systems. That fits the job. Live travel research is messy. Sources block scraping. Prices move. Return-leg details can be less complete than outbound details. A useful answer has to say what it found without pretending the result is checkout-confirmed.

That last point matters. I would rather have an agent say, “this is planning-time fare information and should be rechecked,” than confidently invent a perfect itinerary.


FieldOps benchmark: private operations work

The FieldOps benchmark was the most important suite for comparing the local model against GPT-5.5 in private-AI workflows.

Public web tasks are useful, but companies do not buy local hardware only to rewrite public blog posts. They buy it because they want an agent close to private code, internal documents, customer context, policies, support notes, and operational history.

The FieldOps tasks tested that kind of work.

In the code repair task, the agent had to inspect a small operations codebase, understand failing behavior, make the right change, and pass the validator. This is the core use case for an internal agent: take a bounded bug, touch the repo, and prove the fix.

In the incident/postmortem task, the agent had to do more than patch code. It had to explain what happened, describe the regression, and produce a postmortem artifact. That combines engineering work with operational communication, which is exactly the type of task that consumes time on small teams.

In the private-docs digest task, the agent had to synthesize internal notes and cite source paths while avoiding fake PII leaks. This is where local AI becomes interesting. A cloud model was faster and cleaner, but the local model still produced a useful digest and passed the privacy checks.

The local agent was not perfect. GPT-5.5 tended to produce more granular citations and cleaner synthesis. But the local agent stayed inside the rails. For many internal workflows, that is the first bar.


Office-work benchmark: the messy work people actually ask for

The office-work suite was designed to feel less like a coding benchmark and more like an actual day at a small company.

The website task tested whether the agent could make a small product/marketing change without wrecking the project. It had to work in source files, build the site, and pass validation. Both agents passed. The local model was slower, but it did the right kind of work.

The editorial task tested taste and restraint. The input was a weak draft with hype and overclaiming. The agent had to turn it into something more publishable without making it generic. GPT-5.5 was better here. It produced cleaner structure and stronger caveats in less time. The local model produced usable drafts, but they needed more human editing.

That is not a dealbreaker. It is exactly the type of boundary worth knowing. A local agent can be useful for first drafts and internal writing, but I would not publish local-agent prose untouched.

The travel task tested live research and uncertainty handling. The agent had to search realistic MSY to AUS flight options for two adults within a date window, rank candidates, include prices or explain price limitations, and avoid booking anything. Both agents passed. GPT-5.5 was better at caveating source limitations. The local model was adequate, but a human reviewer would need to recheck return legs and prices before acting.

That is realistic. Travel research is not just retrieval. It is a judgment task under uncertainty.


What the local model did well

The local model was strongest when the task had rails:

  • Read these files
  • Make this artifact
  • Modify this repo
  • Run this validator
  • Cite these source paths
  • Avoid these privacy leaks
  • Do not use network unless allowed
  • Report what changed

That is not a toy category. A lot of valuable company work looks exactly like this.

Inside the Hermes agent loop, the local model repaired code, wrote planning artifacts, created postmortems, produced executive digests, edited web source files instead of patching generated output, and passed the external checks. It also handled the private-doc task without leaking the synthetic customer emails or phone numbers the validator was watching for.

It benefited from the agent harness. This was not a raw chat box. Hermes gave the model tools, files, skills, validation loops, and procedural memory. The model still had to reason, but the harness shaped the work.

That is one of the biggest takeaways. Local AI should not be judged only as a standalone model. In practice, companies will use it as part of an agent loop: skills, tools, policies, validators, approvals, logs, and routing. The base model matters, but the worker system matters too.


Where GPT-5.5 still won

GPT-5.5 was faster overall:

  • 147.6 seconds average per job for GPT-5.5
  • 221.5 seconds average per job for GB10 local

That is roughly a 1.5× speed advantage for GPT-5.5 across this benchmark.

It also scored slightly higher on manual quality:

  • 4.42/5 for GPT-5.5
  • 4.27/5 for GB10 local

The quality gap was not huge, but it was visible.

GPT-5.5 was better at editorial polish. It handled caveats better in the flight memo. It cited more granular sources in the private document digest. It tended to reach the useful shape of an answer with less wandering.

The local model was good enough. The cloud model was more comfortable.

That is probably the most honest split right now.

If the job is private, bounded, and reviewable, local looks attractive. If the job is time-sensitive, ambiguous, customer-facing, or dependent on polished synthesis, the cloud model still has the advantage.


The concurrency test

Sequential benchmarks are useful, but they dodge a business question: can this local model stack support more than one worker at a time?

That matters because the pitch for local AI hardware often drifts into “AI employees.” One assistant is useful. A small pool of private agents is a different value proposition. But you do not get that from a single-agent benchmark.

So I added a local concurrency stress test.

This test used deterministic local tasks only: the website repair and the editorial rewrite. I left out live flight research because network variance would muddy the result. The test ran one, two, and four concurrent local Hermes jobs against the locally hosted Qwen3.6 35B model.

Local concurrencyPassedWall clockNotes
1 job1/1163.5sWebsite task passed
2 jobs2/2276.7sWebsite and editorial both passed
4 jobs2/4373.5sBoth website jobs passed; both editorial jobs failed

The four-job result is important. The failed editorial jobs did not merely produce weak prose. They produced no final artifact, so the validator reported artifacts/final_blog.md missing along with every required section failure.

The details were useful:

ParallelismJobTaskAgent exitValidator exitResult
41Website00Pass
42Editorial01Fail: missing final artifact
43Website00Pass
44Editorial11Fail: missing final artifact

That split matters. The local system did not simply collapse under four jobs. It still completed both website tasks. The failures concentrated on the editorial jobs, and both failures had the same visible symptom: no artifacts/final_blog.md.

That makes the failure more interesting than a generic “too slow” result. It may have been model behavior under load. It may have been server queueing. It may have been agent-loop fragility. It may have been timeout or context pressure. The run did not collect enough low-level telemetry to prove the root cause.

But from an operator’s perspective, the root symptom is enough: at four concurrent local workers, useful artifacts started going missing.

My read is simple: this setup is comfortable as one local worker and plausible as two concurrent workers if slower responses are acceptable. Four concurrent workers is too aggressive for this configuration without better scheduling, retries, batching changes, a smaller model, or a different serving setup.

That does not make the local box useless. It makes capacity planning real.

If you are buying local AI hardware, this is the kind of test you should run before promising the business that one machine can support a team. Single-agent success does not automatically mean you have a local AI department in a box.


Why the failed four-way run matters

I am glad the four-way run failed.

A perfect concurrency chart would have looked cleaner, but it would have been less useful. The failure gives us a boundary. It says local agent capacity is not just a question of whether the model fits in memory or whether a server accepts parallel requests. The worker loop has to finish the actual job.

For local agents, throughput is not enough. Pass rate matters. Artifact completeness matters.

The question is not, “Can the server generate tokens for four requests?”

The question is, “Can four workers complete useful tasks at the same time?”

In this run, the answer was no.

That tells me the next engineering work is orchestration:

  • Queueing instead of unconstrained parallelism
  • Retries for missing artifacts
  • Per-task timeouts
  • Better progress checks
  • Model routing by task type
  • Smaller or faster local models for lower-stakes work
  • Cloud escalation when local pressure is high

That is how this becomes an actual deployment pattern instead of a demo.


Practical recommendation

For this GB10 + Hermes setup, I would use local agents for:

  • Private codebase maintenance with tests and validators
  • Internal document synthesis where privacy matters more than polish
  • Draft generation that a human will edit
  • Structured operational writeups
  • Small web/content changes with clear acceptance checks
  • Workflows where the artifact stays inside the company

I would still route to a cloud model for:

  • High-stakes external writing
  • Fast turnaround synthesis
  • Broad ambiguous reasoning
  • Customer-facing recommendations with weak source data
  • Anything where the first draft needs to be close to final
  • Bursty multi-agent workloads beyond one or two concurrent local jobs

The right architecture is probably not “all local” or “all cloud.” It is policy-aware routing.

Keep private, bounded, reviewable work local. Escalate harder or more time-sensitive work to cloud. Use validators either way.


What buyers should copy

If you are evaluating local AI hardware, do not start with public benchmark scores. Start with ten tasks your company actually did last month.

Good candidates include:

  • A small broken test suite
  • A customer-support digest
  • A bad blog draft
  • A stale landing page
  • A policy rewrite
  • A travel memo
  • An incident writeup
  • A spreadsheet cleanup
  • A private-doc synthesis task
  • A code review with acceptance criteria

Then run the same tasks through your local stack and your cloud stack.

Track the boring numbers:

  • Pass/fail against objective checks
  • Wall-clock time
  • Changed files
  • Privacy mistakes
  • Factual mistakes
  • Number of retries
  • Reviewer cleanup effort
  • Whether the final artifact is actually useful

Also test concurrency early. Do not wait until after you have sold the organization on a local AI rollout. Run one job, two jobs, and four jobs. Watch where latency rises. Watch where artifacts start going missing. Watch whether the system fails gracefully or silently.

A fast answer that needs a full rewrite is not fast. A slower answer with a clean diff, citations, and passing tests may be the better worker. A concurrent job that never writes the required file is not a worker at all. It is just load.


For hobbyists trying this at home

If you are an AI hobbyist trying to copy this pattern, I would not start by chasing the largest model you can barely load.

Start with one local agent and a boring task. Give it a folder, a clear mission, and a validator. Make the validator external to the model. Do not ask the model whether it succeeded. Make it write a file, run a script, and pass or fail.

Then add difficulty in this order:

  1. File inspection only
  2. One-file edit with tests
  3. Multi-file edit with tests
  4. Document synthesis with citations
  5. Private-data redaction
  6. Live web research
  7. Two concurrent agents
  8. Four concurrent agents

Measure wall-clock time, not just tokens/sec. Tokens/sec is useful, but agents spend time reading files, running commands, waiting for builds, recovering from mistakes, and revising after validators fail.

Also, use smaller tasks than you think you need. A 10-minute agent run may be acceptable for a company workflow, but it is painful for iteration. The fastest way to improve a local setup is to build tiny repeatable tasks where you can tell whether the change helped.

My home-lab rule after this run: if an agent cannot produce a required artifact reliably, do not scale it up. Fix the loop first.


Economics: local vs cloud depends on workload

I did not include a full cost model in this run, and I would not trust one without longer utilization data. But the shape is clear.

Local hardware has an upfront cost. Cloud models have a usage cost. The local box becomes more attractive when it runs private internal work all day, especially if that work would otherwise send sensitive context to a cloud model. Cloud becomes more attractive when work is bursty, ambiguous, deadline-sensitive, or quality-sensitive.

Concurrency changes the economics. If one box is comfortable with one worker and plausible with two, that is useful. But it is not the same business case as four or eight reliable workers. The four-way failure means I would be careful about ROI claims until the scheduler and retry story is stronger.

Power also belongs in the next version of this benchmark. I did not capture idle watts, single-agent watts, two-agent watts, four-agent watts, or watt-hours per completed task. AI hobbyists will care about that, and they should. A local setup that looks cheap on token cost can still be inefficient if jobs run long under load.


Reproducibility

The benchmark fixtures are synthetic, but they are shaped like real internal work. That was deliberate. I wanted tasks that could be shared without exposing real customer data while still testing the workflows that matter: code repair, incident writing, private-doc synthesis, public web edits, editorial cleanup, and travel research.

To reproduce this cleanly, someone would need:

  • Fixture directories
  • Benchmark runner scripts
  • Validation scripts
  • Exact prompts
  • Hermes profiles
  • Local model server config
  • Result aggregation script

The harness already preserves transcripts, validation JSON, git diffs, changed-file lists, and run metadata under the run directories. Before publishing the repo, I would strip machine-specific paths, remove irrelevant local cache files, and package a README that explains how to run local and cloud profiles side by side.


Example task prompt

Here is one representative prompt from the office-work suite. This was the editorial task, which is useful because it tests instruction following, taste, and artifact creation without relying on live web access.

You are benchmarking a real-world editorial/content AI employee. Work in the current directory only.

Use humanizer, writing-plans, and requesting-code-review skills if available.

Mission:
1. Read README.md, docs/editorial_brief.md, artifacts/draft_blog.md, and tests/validate_editorial.py.
2. Rewrite the bad draft into artifacts/final_blog.md for Subterra Technologies readers.
3. Make the piece credible, specific, and useful. It should explicitly frame web tasks, editing tasks, and travel-research tasks as more realistic AI-employee work than isolated coding puzzles.
4. Avoid hype and official-leaderboard style claims.
5. Run `python3 tests/validate_editorial.py` and revise until it passes.
6. Final answer must include word-count estimate, what changed, and remaining editorial risks.

Do not use network for this task.

That prompt is not trying to be clever. It gives the agent a job, points it at files, names the output artifact, names the validator, and defines what not to do. That is the pattern I trust most for local agents right now.


What this does not prove

This benchmark does not prove that Qwen3.6 35B is generally better than other local models.

It does not prove that GB10 is the best local AI workstation.

It does not prove production safety.

It does not prove that local AI beats GPT-5.5.

It does not prove that one box can support a whole team of agents.

What it does show is narrower and more useful: with a real agent harness, clear task boundaries, validators, and reviewable artifacts, a local 35B-class model can complete a meaningful set of private office workflows.

That is enough to justify a pilot. It is not enough to skip evaluation.


Caveats

This is a practical workflow benchmark, not an official leaderboard result.

The fixtures are realistic, but they are still fixtures. The validators catch objective failures and obvious privacy mistakes. They do not replace human review.

Manual quality scoring is subjective. I kept it separate from objective pass/fail for that reason. Treat the 4.27 vs. 4.42 quality scores as reviewer notes, not universal model truth.

The flight data was planning data, not checkout-confirmed airfare. The agents were explicitly told not to book anything, spend money, or enter personal data.

The concurrency test used one hardware/software setup and one local model configuration. Different quantization, batching, server settings, context length, model choice, scheduler behavior, or retry policy could change the capacity curve.

I did not capture clean tokens/sec, prompt-eval, decode-rate, wattage, or per-request KV-cache telemetry for the final agent runs. That is the biggest technical gap for readers who want to tune the exact same stack. The current data is stronger as an end-to-end workflow study than as a low-level inference benchmark.

I also would not overfit on small timing differences. A 20-second gap in a local run can come from tool behavior, build steps, or source access. The bigger patterns are more meaningful: GPT-5.5 was consistently faster overall, local was good enough on bounded tasks, and four concurrent local agents crossed a reliability line in this setup.

The strongest claim here is not “local beats GPT-5.5.”

It does not.

The strongest claim is this: a local 35B-class agent on a GB10-class box can complete bounded private office workflows with tools, skills, validators, and reviewable artifacts. In this run, it passed 12 out of 12 sequential local jobs. It was slower than GPT-5.5 and slightly rougher, but it was good enough to pilot.


Bottom line

This is the first local-agent result that feels operationally interesting to me.

Not because the local model won. It did not.

Because it was useful anyway.

The local model stack handled private code repair, internal docs, postmortems, web edits, editorial drafts, and travel research well enough to pass objective checks across repeated runs. That is a real threshold. It means local AI agents are no longer just chat demos or offline toys. With the right harness, they can do bounded company work.

But the concurrency test keeps the conclusion grounded. One worker looked solid. Two worked with latency. Four broke down.

So my recommendation is cautious but positive: if you have sensitive internal workflows, local AI hardware is worth piloting now. Do not buy it expecting a frontier cloud replacement. Buy it if you have private, repeatable work where control matters, validators are available, and humans stay in the loop.

That is less flashy than “local AI beats the cloud.”

It is also more useful.

The next test I would run is a scheduler-aware version of the same benchmark: queue local jobs, add automatic retries for missing artifacts, compare one larger local model against a smaller faster local model, and route overflow to cloud. That is the real product shape: not one model in isolation, but a controlled local-first agent system that knows when to stay private and when to escalate.

local aigtp-5.5private ai