A model in the suite · OpenRouter

GLM 5.2 (OpenRouter)

OpenRouter · Z.ai GLM · z-ai/glm-5.2 / xhigh reasoning / Chat Completions / no-require-parameters · 2026-06-19

63/100
Strict suite averageNo legacy score · 4 benchmarks

GLM 5.2 via OpenRouter is the suite's most interesting budget run: one excellent knowledge-work result, three unreliable scaffold-level operator runs, and a total measured plus scored cost of $5.17. It is useful for breadth, scaffolds, and scouting, but not as an unattended final-pass model; OpenRouter added provider-envelope noise, while the missed canaries still count as model-quality signal.

Copies GLM 5.2 (OpenRouter)'s full data pack — paste it into ChatGPT, Claude, or any AI to talk it through.

GLM 5.2 (OpenRouter) against the field

How GLM 5.2 (OpenRouter) handled each benchmark

Score, capability radar, and the honest read on what it nailed and where it slipped. Hit Overlay to drop other models onto the same axes.

Dingo & Co. Knowledge Work

A 23-deliverable consulting brief: research, financial reconciliation, regulatory analysis, decks and spreadsheets. Tests whether a model can run an entire knowledge-work engagement end to end.

88
Excellent

An excellent Dingo & Co. run: complete, artifact-rich, visually usable, and unusually strong on the benchmark’s legal/ethical/absurdity traps. It falls short of near-mastery mainly on research citation precision, spreadsheet formula reliability, and final production polish rather than on core reasoning or completion.

OverlayDownload radar
Instr. FollowingArtifact ValiditySource IntegrityResearch GroundingSemantic JudgmentQuant. Reas.Visual StorytellingUX ReviewabilityProd. ReadinessSpeed
1GLM 5.2 (OpenRouter)88
2Claude Fable 581
3Claude Opus 4.880
4GPT-5.578
5Gemini 3.5 Flash (High) Fast62
6Opus 4.754
7Sonnet 4.652
8Gemini 3.1 Pro38

What it nailed

  • Completed the full 23-file package with real Office, PDF, HTML, markdown, and JSON artifacts.
  • Handled the benchmark’s central absurdities with unusually strong judgment: dingo behavior, Alaska/Australia optics, import-created demand, legal uncertainty, ethics, and support liability.
  • Maintained a consistent source-of-truth posture for revenue, launch timing, pricing, budget, runway, imports, TAM, and customer-feedback exclusions.
  • Used provided source imagery in the deck, sales one-pager, dashboard, personas, and copy references rather than inventing inconsistent visuals.
  • Separated real competitors from fictional/scenario competitors and did not treat inquiry counts or curiosity traffic as TAM.
  • Produced a credible board/investor strategy with staged gates and non-evasive answers to uncomfortable questions.

Where it slipped

  • Regulatory research is directionally strong but citation precision is uneven; several official sources are generic agency URLs rather than direct rule pages.
  • Some market-sizing claims are derived or loosely supported and would need stronger verification before investor/public use.
  • The pricing workbook’s Cost_Sensitivity formulas appear to contain column-reference errors, reducing spreadsheet reliability.
  • The public-facing press release still mentions Northern Canid Imports/acquisition support, which may be legally and reputationally sensitive despite the no-CTA posture.
  • The package is visually solid but not elite editorial design; operator review noted the one-pager footer is close to the page edge.
  • The dashboard depends on Chart.js via CDN, so it is not fully self-contained/offline despite being a standalone HTML artifact.
Wall clock 24m 20s

From the run

Car Wash Operations

A filthy operational dataset — ghost records, orphaned orders, typo'd customers, raw enum variants. Tests judgment under messy real-world data: what gets fixed, quarantined, or wrongly promoted.

55
Interesting but Unreliable

A substantial and inspectable audit package with a working SQLite database, useful documentation, provenance scaffolding, and a solid reviewer UI. However, it is not reliable as a migration result because multiple planted primary canaries failed: one ghost/test record is promoted as active, typo-order names are not merged, the DeShawn SVC-007 conflict and department-code mapping are not demonstrated, and price-era semantics are wrong. The result is useful as a scaffold for human review, but capped at 55 for missed core canaries.

OverlayDownload radar
Instr. FollowingArtifact ValiditySource IntegritySemantic JudgmentQuant. Reas.UX ReviewabilityProd. ReadinessSpeed
1Claude Fable 588
2Claude Opus 4.886
3GPT-5.555
4GLM 5.2 (OpenRouter)55
5Gemini 3.5 Flash (High) Fast51
6GPT-5.451
7Opus 4.748

What it nailed

  • Produced all required artifacts: SQLite DB, migration script, documentation, report, design doc, static UI, and frontend JSON.
  • Strong source inventory foundation with hashes, source-file statuses, and sensitive-file skipping.
  • Working normalized database with populated customers, jobs, services, payments, conflicts, flags, rejected items, source records, and duplicate image groups.
  • Good reviewer UI that passed operator visual QA and mobile usability checks.
  • Correctly handled several important issues: Terrence Blackwood as orphan, corrupted JSON as partial, Mickey/Test Customer rejection, and byte-identical duplicate images.

Where it slipped

  • Primary canary failures cap the score: `Asdf Asdf` promoted as active, typo-order names not merged, SVC-007/The Works conflict not clearly detected, and department code normalization absent.
  • Handwritten receipt images and PDF invoices are mostly metadata-only/manual-review items, not extracted business evidence.
  • Price-era handling is materially wrong for many services; the report and DB show several old prices equal to current prices.
  • Entity-resolution audit tables are empty despite duplicate-merge claims, and multiple partial/typo customer entities remain.
  • Provenance is useful but often file-level rather than exact canonical-record-to-source-record tracing.
Misses Three Or More Primary CanariesMissed Primary CanariesPromotes Ghost Records
Wall clock 22m 43s

From the run

Brick — The AI LEGO Build

50
Interesting but Unreliable

The repaired artifacts are an impressive source-of-truth-driven interactive scaffold across all four brick-kit prompts, with exact piece counts, concrete part IDs, manifests, step lists, and functional Three.js assembly controls. However, the run is capped because it was not hands-free runnable: the provider run failed and operator repairs were needed to complete the generation/viewer path. The final visuals are usable and legible, but the physical kit design remains only moderately plausible and becomes procedural and coarse at 500–1000 pieces.

OverlayDownload radar
Instr. FollowingArtifact ValiditySource IntegritySemantic JudgmentQuant. Reas.Spatial Reas.Visual StorytellingUX ReviewabilityProd. ReadinessSpeed
1Claude Fable 588
2Claude Opus 4.882
3Gemini 3.5 Flash (High) Fast56
4GLM 5.2 (OpenRouter)50

What it nailed

  • All four requested build levels are represented by final index.html pages after intervention.
  • The artifact design uses a credible single-source-of-truth structure, with kitSpec data driving manifests, steps, and animation.
  • Piece counts are exact in the displayed manifests across 100, 250, 500, and 1000-piece targets.
  • The viewer includes play/pause, previous/next, speed, scrubber, show-complete, orbit, guide, BOM, HUD text, and highlighted additions.
  • Concept silhouettes are recognizable and include many requested thematic features.

Where it slipped

  • The run was operator-assisted after transport/provider failure and cannot be treated as an autonomous completion.
  • Original generated viewer/generation path was not locally runnable without repair.
  • Physical buildability is questionable because the models rely on generic primitives, custom part abstractions, and procedural decorative massing.
  • The largest build has an overloaded final detail batch and is not a fine-grained real assembly guide.
  • Visuals are legible but blocky/procedural rather than client-grade designer brick builds.
  • Mobile usability is functional but the HUD takes up much of the small viewport.
Non Runnable VisualizerNon Runnable Core Artifact
Wall clock 26m 20s

From the run

Artemis II Mission Visualization

58
Interesting but Unreliable

A technically impressive and visually usable Artemis II interactive package, but not a trustworthy publication artifact. It completes the requested files and delivers a working 3D timeline with strong mission-beat coverage, yet the factual layer is deeply unreliable: completed-mission and post-flight claims are unsupported by exact sources, many citations are generic, and the visualization contains a material ICPS/TLI inconsistency. The result is an interesting scaffold rather than a source-grounded mission visualization ready for public use.

OverlayDownload radar
Instr. FollowingArtifact ValiditySource IntegrityResearch GroundingSemantic JudgmentQuant. Reas.Spatial Reas.Visual StorytellingUX ReviewabilityProd. ReadinessSpeed
1Claude Fable 586
2GPT-5.579
3Claude Opus 4.876
4Opus 4.760
5GLM 5.2 (OpenRouter)58
6Gemini 3.5 Flash (High) Fast54

What it nailed

  • Complete artifact set with clear documentation and run instructions.
  • Runnable, nonblank Three.js visualization with meaningful controls and phase-driven scene changes.
  • Strong mission-beat coverage across launch, staging, orbit checkout, lunar flyby, return, re-entry, splashdown, and recovery.
  • Visual QA found the desktop and mobile presentations usable, with no hard visual blocking defects.

Where it slipped

  • Severe citation discipline failures: many specific claims are tied to generic or non-specific URLs.
  • Public-facing fact sheet and visualization overclaim completed mission/post-flight events without auditable source support.
  • Material internal inconsistency around ICPS separation and TLI undermines mission-architecture accuracy.
  • Production readiness is limited by factual unreliability despite a strong interactive shell.
Fabricated Or Broken SourcesPublication Fact Errors
Wall clock 12m 54s

From the run