Benchmarks

X-40™ is benchmarked against methods (not brands). The key production incident proxy is Wrong+Accepted: incorrect outputs that were accepted and shipped. The complementary automation yield metric is Accepted+Correct.

v2.4 — Market benchmark (GPT-4.1)

Covers deterministic, facts, unknowns, attack, and math packs. Includes competitor-style baselines (judge and self-consistency) for comparison, and reports both Wrong+Accepted and Accepted+Correct.

v2.6 — GPT-5.2 coverage benchmark

Stress coverage across 5 seeds and 10,600 runs under the published protocol. Worst-case Wrong+Accepted was driven to 0 for dual_strict and dual_auto_verify.

Download the reproducible artifacts

GPT-5.2 v2.6 summary (MD): bench_gpt52_v26_20251228_172520.md
GPT-5.2 v2.6 raw results (CSV): bench_gpt52_v26_20251228_172520.csv

Benchmarks validate X-40™ under the published protocol and workload packs. Real-world performance depends on prompt classes, provider behavior, and client policy configuration.

Why this matters in production

Teams do not fail because they were “slightly inaccurate.” They fail when incorrect outputs are accepted and shipped. X-40™ is built to reduce that incident channel while keeping automation viable.

Dual-evidence governance

X-40™ can combine behavioral trace signals (telemetry) with an independent QEIv15™ evidence channel (structural anchors via ResearchCore) to reduce single-signal failure modes.

Privacy-first runtime posture

X-40™ supports privacy modes that minimize or avoid retention of user content while preserving auditable governance outputs: indices, reasons, and hashes.