X-40

Benchmarks

X-40™ is benchmarked against methods (not brands). The production incident proxy is Wrong+Accepted: incorrect outputs that were accepted and shipped.

v2.4 — Market benchmark (GPT-4.1)
Covers deterministic, facts, unknowns, attack, and math packs. Includes competitor-style baselines (self-consistency and judge-style approaches) for comparison.
v2.6 — GPT-5.2 coverage benchmark
5 seeds × 1,000 math prompts/seed + packs for unknowns and attack behavior. Published artifacts below.
GPT-5.2 v2.6 artifacts:
Summary (MD): latest.md
Notes: Benchmarks validate behavior under the published protocol and do not claim universal guarantees for all prompts/models. Real-world performance depends on workload profiles, baseline calibration, and policy configuration.
View Benchmarks PageBack to Docs