Benchmarks

X-40™ is benchmarked against methods (not brands). The production incident proxy is Wrong+Accepted: incorrect outputs that were accepted and shipped.

v2.4 — Market benchmark (GPT-4.1)

Covers deterministic, facts, unknowns, attack, and math packs. Includes competitor-style baselines (self-consistency and judge-style approaches) for comparison.

v2.6 — GPT-5.2 coverage benchmark

5 seeds × 1,000 math prompts/seed + packs for unknowns and attack behavior. Published artifacts below.

GPT-5.2 v2.6 artifacts:

CSV: bench_gpt52_v26_20251228_172520.csv

Summary (MD): latest.md

Notes: Benchmarks validate behavior under the published protocol and do not claim universal guarantees for all prompts/models. Real-world performance depends on workload profiles, baseline calibration, and policy configuration.

View Benchmarks Page Back to Docs