ExploitBench reports three arms per panel cell: ⟨model, env⟩ (bare model under a uniform runner), ⟨model, env, adaptive coaching⟩ (with mid-episode coaching), and ⟨model, env, CLI⟩ (the model's native vendor CLI). The three together separate model reasoning from harness effects.
⟨model, env⟩ is the primary arm because we want to measure model strength, not the toolchain wrapped around it. Vendor CLIs bundle context management, prompt scaffolding, retry policies, and early-termination rules around the model, and every vendor ships a different combination. Reporting through one CLI per model conflates capability with wrapper. We also do not customize the runner per model based on context-window size, reasoning mode, or provider economics: every model in the panel sees the same prompt template, the same six MCP tools, and the same turn-budget enforcement, so cell-to-cell differences attribute to the model rather than to provider scaffolding or to how we configured its harness.
The secondary arms isolate the scaffolding effect. ⟨model, env, adaptive coaching⟩ adds AutoNudge from the runner (automatic mid-episode prompts asking a stalled agent to call grade, consolidate near the budget, or continue when it stops emitting tool calls); the delta versus ⟨model, env⟩ is the coaching effect. ⟨model, env, CLI⟩ swaps in the vendor's native CLI for the same model on the same bug; the delta is the CLI effect. The three arms together tell us what a bare model can reason about, where coaching helps or hurts (it does both, depending on the model), and where vendor scaffolding raises or lowers the ceiling.
In our experiment with the Codex CLI, GPT-5.5 ran faster and cheaper, with apparent capability gain. GPT-5.5 reached arbitrary code execution on v8-cve-2024-2887 (its only full-ACE cell across the three arms) at turn 165 for $17.80. Across the 41-bug matrix, Codex grew GPT-5.5's Tier-3 reach from 13 to 20 bugs at roughly 1/5 the per-episode cost. We are preparing an evaluation guideline that lets CLI effects like this be compared across vendors on equal footing.