Quarterly research · Q3 2026 · Production labelled test sets
LLM Production Benchmark Q3 2026
Honest benchmark of Claude (Sonnet 4.6, Opus 4.7, Haiku 4.5), GPT-4 (4o, Turbo), and Gemini 2.5 (Pro, Flash) on our internal labelled test sets covering document processing, customer service, and compliance review workflows.
Headline findings (Q3 2026)
- Best quality / cost ratio: Claude Sonnet 4.6 with prompt caching for document processing and compliance review.
- Best multimodal + tool use: GPT-4o for workflows requiring vision + structured tool calling.
- Best cheap classification: Gemini 2.5 Flash for simple intent classification at $0.30/MTok.
- Best long-context retrieval: Claude Sonnet 4.6 (200k+ context handling without quality degradation).
- Caching impact: Anthropic prompt caching reduces inference cost 58% at 70% cache hit rate on production system prompts.
Methodology
We test each model on three internal labelled test sets representing production workflows:
- Document processing: 240 cases, mixed PDF/scanned/native documents, structured extraction + validation. Sourced from anonymised production engagements.
- Customer service: 180 cases, mixed intent classification + response drafting. Sourced from anonymised e-commerce and SaaS engagements.
- Compliance review: 120 cases, policy-clause comparison + flagging. Sourced from anonymised banking and legal engagements.
Metrics measured: accuracy (vs labelled expected output), F1 (precision + recall), latency p95 (ms), cost per 1000 cases ($), hallucination rate (% of outputs with unsupported claims).
Results: document processing
Sorted by quality-cost ratio (accuracy × F1) / cost per 1000 cases:
- Claude Sonnet 4.6 (cached): accuracy 94.2%, F1 0.91, latency p95 1850ms, $0.42 / 1000 cases.
- Gemini 2.5 Pro: accuracy 91.8%, F1 0.88, latency p95 1620ms, $0.51 / 1000 cases.
- GPT-4o: accuracy 92.5%, F1 0.89, latency p95 1730ms, $0.78 / 1000 cases.
- Claude Opus 4.7: accuracy 96.1%, F1 0.94, latency p95 2440ms, $1.85 / 1000 cases.
- Claude Haiku 4.5: accuracy 87.4%, F1 0.83, latency p95 920ms, $0.09 / 1000 cases.
Verdict: Claude Sonnet 4.6 with caching is the production default for document processing in 2026. Opus is the "hard case" fallback when accuracy justifies 4× the cost.
Results: customer service (intent + drafting)
- Claude Sonnet 4.6 (cached): intent accuracy 96.1%, drafting CSAT 4.6/5, latency p95 1280ms, $0.18 / 1000 cases.
- GPT-4o: intent accuracy 95.8%, drafting CSAT 4.5/5, latency p95 1180ms, $0.38 / 1000 cases.
- Gemini 2.5 Flash: intent accuracy 92.4%, drafting CSAT 4.3/5, latency p95 720ms, $0.07 / 1000 cases.
Verdict: Gemini Flash for intent classification on routine support, Claude Sonnet for the drafting layer. Combined router pattern reduces cost ~70% vs single-model Sonnet.
Results: compliance review
- Claude Opus 4.7: accuracy 95.6%, false-positive 4.1%, latency p95 3120ms, $4.20 / 1000 cases.
- Claude Sonnet 4.6 (cached): accuracy 91.8%, false-positive 6.8%, latency p95 1940ms, $0.62 / 1000 cases.
- GPT-4 Turbo: accuracy 93.1%, false-positive 5.9%, latency p95 2180ms, $1.45 / 1000 cases.
Verdict: compliance review is where Opus earns its premium. The 4.1% false-positive rate vs Sonnet's 6.8% translates to ~40% less reviewer time, which more than offsets Opus's cost premium at this workflow.
Caching impact (Anthropic)
Anthropic prompt caching had the largest single cost-reduction impact in our benchmarks. With 70% cache hit rate (typical for production system prompts of 4-12k tokens):
- Sonnet input cost reduction: 70% × $0.30 (cached) + 30% × $3.00 (uncached) = $1.11/MTok blended, vs $3.00 uncached. 63% savings.
- Haiku input cost reduction: 70% × $0.08 + 30% × $0.80 = $0.30/MTok blended, vs $0.80 uncached. 62% savings.
What we use in production
Across our active engagements in 2026 Q3:
- Default workflow model: Claude Sonnet 4.6 with caching
- Classification + routing: Claude Haiku 4.5 or Gemini Flash
- Compliance review + hard cases: Claude Opus 4.7
- Vision + tool-use heavy: GPT-4o
We use the claude-multi-model-router for cost-optimised routing across these.
Next benchmark (Q4 2026)
Q4 2026 benchmark will add: emerging open-weight models (Llama 4, Mistral Large 3), specialised reasoning models, and multimodal-heavy workflows. Subscribe to our updates or check back in October 2026.