Honest benchmark of Claude (Sonnet 4.6, Opus 4.7, Haiku 4.5), GPT-4 (4o, Turbo), and Gemini 2.5 (Pro, Flash) on our internal labelled test sets covering document processing, customer service, and compliance review workflows.

Published 20 May 2026·Last reviewed 7 June 2026·CC BY 4.0 — free to cite with attribution

Methodology

We test each model on three internal labelled test sets representing production workflows:

Document processing: 240 cases, mixed PDF/scanned/native documents, structured extraction + validation. Sourced from anonymised production engagements.
Customer service: 180 cases, mixed intent classification + response drafting. Sourced from anonymised e-commerce and SaaS engagements.
Compliance review: 120 cases, policy-clause comparison + flagging. Sourced from anonymised banking and legal engagements.

Metrics measured: accuracy (vs labelled expected output), F1 (precision + recall), latency p95 (ms), cost per 1000 cases ($), hallucination rate (% of outputs with unsupported claims).

Results: document processing

Sorted by quality-cost ratio (accuracy × F1) / cost per 1000 cases:

Claude Sonnet 4.6 (cached): accuracy 94.2%, F1 0.91, latency p95 1850ms, $0.42 / 1000 cases.
Gemini 2.5 Pro: accuracy 91.8%, F1 0.88, latency p95 1620ms, $0.51 / 1000 cases.
GPT-4o: accuracy 92.5%, F1 0.89, latency p95 1730ms, $0.78 / 1000 cases.
Claude Opus 4.7: accuracy 96.1%, F1 0.94, latency p95 2440ms, $1.85 / 1000 cases.
Claude Haiku 4.5: accuracy 87.4%, F1 0.83, latency p95 920ms, $0.09 / 1000 cases.

Verdict: Claude Sonnet 4.6 with caching is the production default for document processing in 2026. Opus is the "hard case" fallback when accuracy justifies 4× the cost.

Results: customer service (intent + drafting)

Claude Sonnet 4.6 (cached): intent accuracy 96.1%, drafting CSAT 4.6/5, latency p95 1280ms, $0.18 / 1000 cases.
GPT-4o: intent accuracy 95.8%, drafting CSAT 4.5/5, latency p95 1180ms, $0.38 / 1000 cases.
Gemini 2.5 Flash: intent accuracy 92.4%, drafting CSAT 4.3/5, latency p95 720ms, $0.07 / 1000 cases.

Verdict: Gemini Flash for intent classification on routine support, Claude Sonnet for the drafting layer. Combined router pattern reduces cost ~70% vs single-model Sonnet.

Results: compliance review

Claude Opus 4.7: accuracy 95.6%, false-positive 4.1%, latency p95 3120ms, $4.20 / 1000 cases.
Claude Sonnet 4.6 (cached): accuracy 91.8%, false-positive 6.8%, latency p95 1940ms, $0.62 / 1000 cases.
GPT-4 Turbo: accuracy 93.1%, false-positive 5.9%, latency p95 2180ms, $1.45 / 1000 cases.

Verdict: compliance review is where Opus earns its premium. The 4.1% false-positive rate vs Sonnet's 6.8% translates to ~40% less reviewer time, which more than offsets Opus's cost premium at this workflow.

Caching impact (Anthropic)

Anthropic prompt caching had the largest single cost-reduction impact in our benchmarks. With 70% cache hit rate (typical for production system prompts of 4-12k tokens):

Sonnet input cost reduction: 70% × $0.30 (cached) + 30% × $3.00 (uncached) = $1.11/MTok blended, vs $3.00 uncached. 63% savings.
Haiku input cost reduction: 70% × $0.08 + 30% × $0.80 = $0.30/MTok blended, vs $0.80 uncached. 62% savings.

What we use in production

Across our active engagements in 2026 Q3:

Default workflow model: Claude Sonnet 4.6 with caching
Classification + routing: Claude Haiku 4.5 or Gemini Flash
Compliance review + hard cases: Claude Opus 4.7
Vision + tool-use heavy: GPT-4o

We use the claude-multi-model-router for cost-optimised routing across these. For how this model mix sits inside a full production workflow, see our AI-native workflow automation guide.

Next benchmark (Q4 2026)

Q4 2026 benchmark will add: emerging open-weight models (Llama 4, Mistral Large 3), specialised reasoning models, and multimodal-heavy workflows. Subscribe to our updates or check back in October 2026.

LLM Production Benchmark Q3 2026