A model in the suite · Anthropic

Claude Sonnet 5 (xhigh)

Anthropic · Claude Sonnet 5 · Claude Sonnet 5 / xhigh reasoning / Benchmark Suite 2.0 canonical v4 assembly · 2026-07-01

Corrected scores

Published from CORRECTED_SCORING_ROLLUP.md for 2026-06-30__claude-sonnet-5-xhigh__canonical-v4. The old v4 SCORECARD.json outputs remain disputed and are not used. Car Wash is included from the recovery source slug 2026-07-01__claude-sonnet-5-xhigh__parallel-carwash-for-v4 because v4 had not imported it at scoring time.

74/100
Strict suite averageNo legacy score · 4 benchmarks

Claude Sonnet 5 xhigh is now published from the corrected Benchmark Suite 2.0 scoring rollup: Dingo 81.0, Brick 78.3, Artemis 71.0, and Car Wash 64.0, for a four-benchmark average of 73.6. The qualitative read is more grounded than the disputed scorer pass: Dingo frontend/UI remains unresolved, Brick is complete but stacked titles/cards hurt guide readability, Artemis has strong controls but mobile/overlay/information-density issues block production readiness, and Car Wash is capped by the promoted Mickey Mouse ghost record.

Copies Claude Sonnet 5 (xhigh)'s full data pack — paste it into ChatGPT, Claude, or any AI to talk it through.

Claude Sonnet 5 (xhigh) against the field

How Claude Sonnet 5 (xhigh) handled each benchmark

Score, capability radar, and the honest read on what it nailed and where it slipped. Hit Overlay to drop other models onto the same axes.

Dingo & Co. Knowledge Work

81
Excellent, floor

Corrected to 81.0 from the corrected Dingo scorecard. The written/package work is strong, but the score sits at the floor of Excellent because DingoBox Pro frontend/UI problems remain unresolved on the primary dashboard.

OverlayDownload radar
Instr. FollowingArtifact ValiditySource IntegrityResearch GroundingSemantic JudgmentQuant. Reas.Visual StorytellingUX ReviewabilityProd. ReadinessSpeed
1GLM 5.2 (OpenRouter)88
2Claude Fable 581
3Claude Sonnet 5 (xhigh)81
4Claude Opus 4.880
5GPT-5.578
6Gemini 3.5 Flash (High) Fast62
7Opus 4.754
8Sonnet 4.652
9Gemini 3.1 Pro38

What it nailed

  • Complete multi-artifact package with real Office, PDF, HTML, workbook, markdown, and JSON deliverables.
  • Excellent handling of dingo suitability, Alaska/Australia optics, Northern Canid Imports, legal ambiguity, ethics, and support-language liability.
  • Strong reconciliation of revenue, units, price, launch timing, budget, attach-rate anomalies, TAM inflation, and quote permissions.
  • Broad, organized source log with official/legal sources separated from market/commercial sources.
  • Copy is well calibrated across investor, internal, press, blog, LinkedIn, playful, deadpan, and luxury contexts.

Where it slipped

  • Primary dashboard still has major frontend/UI concerns on desktop and mobile; Sonnet 5 did not solve the DingoBox Pro UI problem.
  • Mobile dashboard presentation is not production-clean, including awkward hero copy and filename wrapping in the first viewport.
  • Legal, regulatory, support, rehoming, and tooling-deposit items remain open for human confirmation before external use.
  • CAC/LTV, price elasticity, and margin modeling are not rigorous enough for unsupervised investor use.
Rendered Visual Hard Defects

From the run

Car Wash Operations

64
Competent Scaffold

Corrected Car Wash score from the recovery source slug: 64.0, Competent scaffold. The package is complete and useful, but the benchmark-specific promotes_ghost_records cap applies because Mickey Mouse remains a live canonical customer/job instead of being rejected as a ghost/test record.

OverlayDownload radar
Instr. FollowingArtifact ValiditySource IntegritySemantic JudgmentQuant. Reas.UX ReviewabilityProd. ReadinessSpeed
1Claude Fable 588
2Claude Opus 4.886
3Claude Sonnet 5 (xhigh)64
4GPT-5.555
5GLM 5.2 (OpenRouter)55
6Gemini 3.5 Flash (High) Fast51
7GPT-5.451
8Opus 4.748

What it nailed

  • Complete required artifact set and an openable SQLite database with real source/provenance/review tables.
  • Strong handling of most high-value planted obstacles: SVC-007, Terrence Blackwood, typo-order names, duplicate contacts, corrupted JSON, price-list conflicts, duplicate images, sensitive-file bait, status/payment normalization, and before/after mismatch review.
  • Useful static reviewer UI with dashboard, canaries, source inventory, conflicts, rejected records, review queue, before/after evidence, and provenance modal.
  • Documentation is detailed and generally candid about review queues, low-confidence OCR, unresolved payment matching, combined-price line items, and sensitive-file handling.

Where it slipped

  • Cap-triggering failure: Mickey Mouse is an explicit obstacle-key ghost/test record but is promoted as canonical customer 181 with live job J2024-1001.
  • The Canaries UI and MIGRATION_REPORT present test/junk handling as successful while omitting Mickey Mouse, creating a misleading reviewer signal.
  • Department/role-code normalization is not represented as first-class canonical data despite being a planted obstacle.
  • Static screenshot requirements are present, but they live under artifacts/screenshots rather than the run-level screenshots directory expected by registry convention.
  • migration_runs stores absolute local source/output paths, reducing portability polish.
Promotes Ghost Records

From the run

Brick — The AI LEGO Build

78
Strong

Corrected aggregate Brick score from the corrected rollup: 78.3, with case scores 78.0, 79.0, 79.5, and 76.5. The cases are complete, runnable, and data-driven, but repeated title/card stacking on desktop and mobile directly hurts build-guide readability.

OverlayDownload radar
Instr. FollowingArtifact ValiditySource IntegritySemantic JudgmentQuant. Reas.Spatial Reas.Visual StorytellingUX ReviewabilityProd. ReadinessSpeed
1Claude Fable 588
2Claude Opus 4.882
3Claude Sonnet 5 (xhigh)78
4Gemini 3.5 Flash (High) Fast56
5GLM 5.2 (OpenRouter)50

What it nailed

  • All four isolated Brick cases completed with in-range piece counts and runnable browser visualizers.
  • Source-of-truth kitSpec structure is strong across the suite: unique IDs, step coverage, manifest/instruction/viewer alignment, and concrete part IDs.
  • Concept adherence is strong across all four requested builds, including the complex 1000-piece airship research station.
  • Visualizers rendered in desktop and mobile capture with HTTP 200, no console/page errors, and nonblank screenshots.

Where it slipped

  • Systemic major UI readability defect across every rendered case: titles on titles, titles over cards, and cards/text stacking visually.
  • The clutter makes the build guides harder to read and use, especially on mobile and in high-step-count builds.
  • Physical plausibility remains approximate at larger scales, especially custom/decorative elements and the airship balloon/docking system.
  • Generation was slow/expensive for isolated cases, which reduces speed and solo-operator practicality.
Major UI Readability Defect

From the run

Artemis II Mission Visualization

71
Strong, low end

Corrected to 71.0 from the corrected Artemis scorecard. The app has the strongest Artemis control layer so far, but the rendered experience is information-light, mobile has visible collisions, desktop has overlay/layering artifacts, and the graphics are not production-ready.

OverlayDownload radar
Instr. FollowingArtifact ValiditySource IntegrityResearch GroundingSemantic JudgmentQuant. Reas.Spatial Reas.Visual StorytellingUX ReviewabilityProd. ReadinessSpeed
1Claude Fable 586
2GPT-5.579
3Claude Opus 4.876
4Claude Sonnet 5 (xhigh)71
5Opus 4.760
6GLM 5.2 (OpenRouter)58
7Gemini 3.5 Flash (High) Fast54

What it nailed

  • Best Artemis control experience so far: scrubber, play/pause, speed controls, event and phase stepping, quick jumps, reset view, orbit-camera controls, keyboard shortcuts, and mobile HUD toggle.
  • Complete primary deliverable set with fact sheet, source list, visualization, documentation, vendored Three.js files, screenshots, raw output, validation report, and imported operator findings.
  • Substantial research package with official-source prioritization and explicit discussion of source discrepancies.
  • Mission-specific event model covers the required Artemis II beats and avoids a pure generic-orbit fallback.
  • Strong local operability: no build step, vendored dependencies, documented direct-open and static-server options.

Where it slipped

  • Information-light rendered experience compared with other Artemis runs; too much of the detailed research lives outside the live visualization experience.
  • Major mobile presentation failure: cards, controls, quick-jump buttons, timeline, and text collide in the captured mobile viewport.
  • Desktop visual integrity issue: launch tower/scene artifacts visibly overlay the rocket and other artifacts, making the view feel confused rather than polished.
  • Visual storytelling and video usefulness are not publication-grade despite the functional control layer.
  • Citation discipline is broad rather than fully claim-level, and no independent link/support validation was provided.
  • Missing requested screenshot: screenshots/artemis-factsheet-top.png.

From the run