A model in the suite · Anthropic

Claude Fable 5

Anthropic · Claude Fable · Fable 5 / 1M context / extra-high effort / Suite 2.0 API harness · 2026-06-10

86/100
Strict suite averageNo legacy score · 4 benchmarks

Claude Fable 5 sets the suite's new high-water mark at 85.8 — the first model to clear 80 on all four artifact-production benchmarks. The headline is operator discipline: on Car Wash (88) it migrated 465 messy files into a provenance-tracked database with the best reviewer UI the suite has produced, quarantined every planted canary, and inventoried the bait credentials without leaking them. Brick (88) delivered four count-correct, animated, buildable kits; Artemis (86) researched the real, just-flown mission with genuine NASA sourcing — a score that survived a judge knowledge-cutoff challenge. The ceiling-holders are visual taste under pressure (chart spacing, default-state control contrast, and deck polish still trail Opus 4.7's bar, costing it on Dingo at 81) and speed: extra-high effort runs long. Verdict: the strongest end-to-end knowledge-work operator tested so far, with room left in design craft.

Copies Claude Fable 5's full data pack — paste it into ChatGPT, Claude, or any AI to talk it through.

Claude Fable 5 against the field

How Claude Fable 5 handled each benchmark

Score, capability radar, and the honest read on what it nailed and where it slipped. Hit Overlay to drop other models onto the same axes.

Dingo & Co. Knowledge Work

A 23-deliverable consulting brief: research, financial reconciliation, regulatory analysis, decks and spreadsheets. Tests whether a model can run an entire knowledge-work engagement end to end.

81
Excellent

Recalibrated from the prior visual-audit score of 78.4 to 81.0. The package remains substantively impressive: complete, well-researched, legally cautious, and strategically strong on the dingo/import absurdities. The rendered visuals still fail professional frontend standards in places, with overflowing text, clipped headings, chart-label collisions, cut-off captions, disconnected funnel graphics, and poor typography/spacing; therefore the visual_storytelling and ux_reviewability signals remain capped below 60. Under the operator’s cross-model calibration, however, these defects are treated as systemic but non-blocking and comparable to the Opus 4.8 funnel-visual case, so the strict-score visual deduction from the pre-audit 83.0 is about 2 points total.

OverlayDownload radar
Instr. FollowingArtifact ValiditySource IntegrityResearch GroundingSemantic JudgmentQuant. Reas.Visual StorytellingUX ReviewabilityProd. ReadinessSpeed
1GLM 5.2 (OpenRouter)88
2Claude Fable 581
3Claude Opus 4.880
4GPT-5.578
5Gemini 3.5 Flash (High) Fast62
6Opus 4.754
7Sonnet 4.652
8Gemini 3.1 Pro38

What it nailed

  • Completed the full artifact set with correct filenames and real business-document formats.
  • Handled the benchmark’s central absurdities and legal/ethical traps with unusually strong judgment.
  • Produced a robust assumptions file and source log that separate official sources, secondary sources, internal estimates, and fictional competitors.
  • Used provided image assets and generated real visual artifacts rather than text-only stand-ins, even though the rendered polish is flawed.
  • Strong GTM and investor-facing strategy with staged budget gates, channel rules, support-language controls, and NCI risk mitigation.

Where it slipped

  • Rendered deck/dashboard visuals contain hard defects: overflowing funnel text, a clipped Executive summary heading, labels touching chart elements, cut-off caption text, disconnected/misaligned funnel graphics, and poor typography/spacing.
  • Material price inconsistency: workbook/deck state a $749 hard floor, while email/GTM material offers a $699 lapsed-owner price.
  • TAM math is internally inconsistent; stated $45M-$85M and $60M+ conclusions do not follow from the product-fit formulas shown in the workbook.
  • Some public-facing copy includes unsupported or risky factual claims, especially 'first ten thousand support tickets' and beta-use claims.
  • A few research claims rely on secondary or commercial sources where primary verification would be preferable.
  • Some customer/beta quote usage is not clearly traceable to permissioned evidence despite the package’s stated quote policy.
Material Number InconsistencyRendered Visual Hard Defects
Wall clock 28m 23s

From the run

Car Wash Operations

A filthy operational dataset — ghost records, orphaned orders, typo'd customers, raw enum variants. Tests judgment under messy real-world data: what gets fixed, quarantined, or wrongly promoted.

88
Excellent

Excellent migration package: robust database, provenance, review workflow, canary handling, documentation, and a standout visual reviewer UI. It falls short of near-mastery mainly because image-only handwritten receipt obstacles were not structurally extracted and a few business-semantics/entity-resolution gaps remain.

OverlayDownload radar
Instr. FollowingArtifact ValiditySource IntegritySemantic JudgmentQuant. Reas.UX ReviewabilityProd. ReadinessSpeed
1Claude Fable 588
2Claude Opus 4.886
3GPT-5.555
4GLM 5.2 (OpenRouter)55
5Gemini 3.5 Flash (High) Fast51
6GPT-5.451
7Opus 4.748

What it nailed

  • Comprehensive, auditable SQLite package with evidence, canonical, provenance, conflict, reject, and review layers.
  • Strong handling of core messy-data canaries: ghost/test quarantine, corrupt JSON recovery, duplicate-image detection, price-era conflicts, status/payment normalization, and typo/nickname aliases.
  • Excellent reviewer-facing UI; operator visual review found it top-of-suite with no visual defects.
  • Good security posture: sensitive credential/payroll bait inventoried but not leaked.
  • Deterministic rebuild story with static, inspectable artifacts and screenshots.

Where it slipped

  • Handwritten PNG receipts/images were inventoried and deduplicated but not OCR-transcribed into structured records, leaving several planted image-only business issues for manual review.
  • Department/role-code normalization is not visible in the provided DB schema excerpts.
  • Terrence Blackwood orphan handling is not directly evidenced in the supplied excerpts, though no cap-triggering silent promotion is shown.
  • Some residual entity-resolution misses are visible, such as separate Amber and Amber Bilbow records with matching contact information.
  • The DB metadata stores an absolute local source path, which slightly reduces portability polish.
Wall clock 29m 44s

From the run

Brick — The AI LEGO Build

88
Excellent

Excellent full-suite Brick-the-AI result with unusually strong data integrity and source-of-truth architecture. All four builds are present, count-correct, animated, chaptered, and generated from structured kit specs. After calibration, the systemic invisible-controls contrast defect is treated as a roughly 2-point strict-score deduction rather than a larger penalty; the remaining score limitations are unresolved physical-plausibility caveats in the larger decorative/off-grid builds, especially the airship.

OverlayDownload radar
Instr. FollowingArtifact ValiditySource IntegritySemantic JudgmentQuant. Reas.Spatial Reas.Visual StorytellingUX ReviewabilityProd. ReadinessSpeed
1Claude Fable 588
2Claude Opus 4.882
3Gemini 3.5 Flash (High) Fast56
4GLM 5.2 (OpenRouter)50

What it nailed

  • Completed the full four-prompt suite rather than a partial benchmark run.
  • Strong single-source-of-truth architecture: structured kitSpec drives parts, steps, BOM, model, and animation.
  • Piece counts and step counts are within the requested ranges for all four builds.
  • Clear chaptered build sequencing with concrete part IDs and on-screen instructions.
  • Ambitious visual features: animated assembly, highlighted new pieces, scrubber, hero view, orbit camera, spinning rotors, liftable/removable sections, and mobile-responsive layout.
  • Generator and smoke-test artifacts provide useful provenance and validation evidence.

Where it slipped

  • Systemic default-state contrast failure makes top-right controls effectively invisible until interaction across the shared UI style.
  • Large-scale physical plausibility is not fully proven, especially the airship's off-grid fabric envelope and suspended/docked final pose.
  • Many specialty elements are arbitrary generated part types rather than a tightly constrained brick vocabulary.
  • No independent collision, clutch-strength, or structural support validation is provided.
  • Requires loading Three.js from a CDN, so the guides are not fully offline-local.
  • Run was very slow and expensive: 29 API calls and about 3661 seconds wall-clock.
Wall clock 1h 1m 0s

From the run

Artemis II Mission Visualization

86
Strong

After operator verification, the prior scorecard's central factual/source failure no longer applies. The Artemis II completed-mission narrative and key NASA post-flight/live-blog citations should be treated as genuine, which substantially raises the research, source integrity, event coverage, and production-readiness assessment. The package is now a strong, complete, technically ambitious mission visualization with credible post-flight grounding. It remains below elite visual quality because of dense HUD/mobile typography and crude low-poly orange markers in some views.

OverlayDownload radar
Instr. FollowingArtifact ValiditySource IntegrityResearch GroundingSemantic JudgmentQuant. Reas.Spatial Reas.Visual StorytellingUX ReviewabilityProd. ReadinessSpeed
1Claude Fable 586
2GPT-5.579
3Claude Opus 4.876
4Opus 4.760
5GLM 5.2 (OpenRouter)58
6Gemini 3.5 Flash (High) Fast54

What it nailed

  • Operator-verified key NASA sources establish that the completed April 1-10, 2026 Artemis II mission narrative is genuine and source-grounded.
  • Complete artifact set with fact sheet, source list, runnable visualization, documentation, vendored Three.js, and screenshots.
  • Technically ambitious interactive visualization with timeline scrubber, phases, event stepping, HUD telemetry, camera controls, and deep links.
  • Mission-specific visual structure covers the major Artemis II beats rather than falling back to a generic orbit scene.
  • Clear packaging and quick-start documentation.

Where it slipped

  • Visual execution is solid but not best-in-suite; the orange low-poly event markers/sun read as crude blobs in the launch close-up.
  • Dense HUD/control text and small mobile typography reduce visual polish and reviewability.
  • The package would be more audit-ready with included link-check logs or source captures for all cited claims.
  • Some highly specific anomaly or programmatic claims would still benefit from careful claim-by-claim citation review before external publication.
Wall clock 23m 52s

From the run