A shared playbook for trustworthy third party evaluations
OpenAI published recommendations for designing trustworthy third-party evaluations of frontier AI models, emphasizing that the surrounding "harness"—the environment, tools, and setup enabling agentic execution—fundamentally shapes measured capabilities and safeguard robustness. The post categorizes evaluation claims and urges evaluators to transparently report their setup, budget, and validity checks to avoid under-elicitation or miscalibrated results.
Source: OpenAI Blog
May 29, 2026
A shared playbook for trustworthy third party evaluations
What matters for effective independent evaluations of safeguards and capabilities for frontier models.
Listen to article
Share
Independent, trusted third party evaluations play a critical role in strengthening the safety ecosystem. These evaluations are conducted on frontier models to provide additional evidence for claims about critical capabilities and safety mitigations. In this post, we share lessons we’ve learned so far, and recommend approaches for designing evaluations that can validly assess frontier models that we hope help inform emerging standards in the space.
Earlier, many evaluations treated models like chatbots: the evaluation prompted a model as though it were a user asking a question, the model answered, and an evaluator judged the output. Today’s frontier models can do much more: they can use tools, keep track of information across many steps, and act within a larger workflow. This means that performance depends not only on the model, but also on the environment in which the task takes place, and on the setup that facilitates its actions. This surrounding setup, which we call the “harness,” can change key aspects of the system’s performance, including how it uses tools, keeps track of information, or recovers from mistakes.
This changes how evaluations need to be conducted, and what readers should look for in evaluation reports. In our view, the most useful reports explicitly describe two things beyond the result itself: First, they specify what claim the evaluation setup was designed to test, and second, they share the available evidence that the evaluation result is valid.
Claims tested in evaluations typically fall into one of three buckets1:
- Capability elicitation: Can a model plausibly produce the capability being evaluated?
- Safeguard performance: How robust are the tested safeguards against the behavior or attack being evaluated?
- Comparison: How do different models perform under equivalent conditions?
Evaluation reports also need to explain how evaluators checked for effects that could impact the validity of a result. These include:
- Reward hacking: Exploiting shortcuts in the task or scorer, so the system gets credit without demonstrating the behavior the evaluation is meant to measure.
- Refusals: Refusing in ways that obscure the behavior being tested.
- Contamination: Overperforming because evaluation tasks, answers, or close variants appeared in training data or were discoverable during the evaluation, such as through browsing.
- Broken problems: Underperforming because tasks are invalid. Reasons can include unfair scoring (e.g., correct answer requires unstated implementation details) and unsolvable environments (e.g., missing critical files or unreliable tools).
- Sandbagging: Deliberately underperforming when they show awareness of being evaluated.
Selecting the right harness for an evaluation is crucial for optimal results
We’ve observed that the role of the harness is especially important for systems that act over longer trajectories. When models can use tools, maintain state, and recover from mistakes across many steps, the harness can change the observed level of performance, and even determine whether the capability that’s being assessed appears in the evaluation at all. For example, a harness that preserves state and retries failed actions may let a model finish a multi-step task that the same model never completes in a simpler harness.
| Claim the evaluation is trying to support | Appropriate harness choice | Evidence to report |
|---|---|---|
| Capability under strong elicitation: System A can complete tasks of type X when the setup is designed to draw out its strongest credible performance. | Use the strongest credible elicitation setup for the system, including the harness, tools, scaffolding, and budget a capable user would reasonably use. | The harness and tool setup, elicitation guidance, budget/effort allowed, tokens/cost/time, and why the setup is a credible proxy for the claimed capability. If comparing systems under different optimized setups, label it as a system-to-system or strong-elicitation comparison. |
| Controlled comparison: System A outperforms System B under a shared evaluation setup. | Keep the tasks, scoring, and budget fixed. Use either a shared harness/tool setup or a fixed set of standardized harnesses chosen up front to provide reasonable max elicitation for the systems being compared. | The shared task set, tools, scoring method, harness, budget, token efficiency/cost, and known limitations. For coding-agent evaluations, an open-source harness such as Codex CLI can provide a fixed agent loop and tool interface across systems. The ideal approach for maximum elicitation would be to optimize a bespoke harness for each task and system, but doing so is currently impractical in practice. |
| Safeguard robustness under elicited attack: System A’s safeguards are sufficient for the relevant model behavior or elicited attack. | Use a safeguard-testing setup designed to elicit the strongest credible attack under the relevant adversary model. | How evaluators characterized the relevant model behavior, the safeguard configuration tested, the elicitation strategy, the harness used to carry it out, and the budget or effort allowed. |
Capability claims are only as strong as the elicitation behind them: evaluators need to choose the harness that best fits the task and the capability the evaluation is trying to measure. A standardized harness may be right for comparing systems under identical conditions, but it can understate capability when it leaves out specific harness features that help the model perform the task. For example, GPT‑5.5’s performance on OpenAI’s cyber ranges shows how a harness choice can materially change measured capability on tasks that require long, multi-step tool use: the model performs better when the harness uses compaction to preserve task-relevant context as the interaction gets longer. This demonstrates that for certain models, a harness that omits compaction would under-elicit performance.
Compaction improves success on multi-step cyber range tasks
Without CompactionWith Compaction0%20%40%60%80%100%Percent of ranges solved (pass@16)15.4%7.7%92.3%69.2%GPT-5.5GPT-5.4
Higher success rates are better
Other published evaluations2 also show harness and budget choices changing evaluation results. Increasing test-time compute can significantly change what capability an evaluation elicits, especially in domains where success is easy to verify, such as many cyber tasks. In UK AISI’s cyber range evaluation, increasing the budget from 10M to 100M tokens improved performance by up to 59%, and performance was still increasing at the highest budget tested. Detailing this makes the evaluation more interpretable: it shows readers how the result depends on the tested elicitation setup. When performance is still improving with additional budget, the score should be described as performance under that harness and budget, not as a measured capability ceiling. Capability is often resource-dependent rather than a fixed quantity that can be cleanly measured once and for all. Where success can be measured across repeated attempts, reports should also consider expected cost per successful solve, not just success rate at a fixed token budget. This can make severity easier to interpret: a low success rate may still be practically meaningful if the cost of repeated attempts is within the relevant threat model. For capability claims, avoidable under-elicitation is a measurement failure: if the harness or budget prevents the system from exhibiting behavior it could otherwise produce, the score does not measure the capability being claimed. Where evaluators have pushed elicitation as far as is feasible and performance is still improving, reports should say so clearly and make clear that the result is only a lower-bound estimate.
Safeguard testing can understate whether an attack can succeed, and how severe it could be, when not accounting for the resources available to attackers, including custom harnesses. In UK AISI's GPT‑5.5 cyber evaluation, their expert red teaming found a universal jailbreak that elicited violative cyber content across the malicious queries OpenAI provided, including in multi-turn agentic settings. They used Codex to create a custom harness to strengthen the model’s attack performance: it embedded a reusable safeguard-bypass pattern into the interaction, preserved that pattern across turns and blocks, and applied it across the malicious cyber queries OpenAI provided. Safeguard testing should match the adversary. If the claim is about robustness to expert misuse, the test should evaluate the strongest credible end-to-end attack strategy under a defined budget, including any harness needed to preserve and reuse that strategy. Otherwise, the results risk miscalibration: they could support only a narrower claim about resistance to simpler prompting, could miss both how severe the attack becomes and its probability of success once the elicitation method is operationalized, and could also overstate how likely or severe a problem is if given too much budget.
There is a time and place for standardized harness comparisons, but evaluators should be explicit about why using a consistent set of harnesses is appropriate and what claim it can support. METR's time-horizon evaluation is an example of a broader, appropriately fixed evaluation setup: it is designed to produce comparable results across the systems it evaluates. METR defines a common outcome, the typical duration for a human task at which an AI agent is predicted to succeed at a given reliability level. It applies a shared task suite, scoring method, fitting method, and a small set of reusable scaffolds such as Triframe and ReAct within each batch of estimates reported together. When METR expanded the task suite and moved evaluation infrastructure from a framework called Vivaria to one called Inspect, it reported the change (Time Horizon 1.1 update) and re-evaluated models under the new evaluation setup. That is the value of a standardized evaluation setup, including a consistent harness set: it can make readers confident that a difference in scores really reflects a difference between the systems being compared, rather than a change in the measurement setup.
We recommend that third party evaluation reports state what kind of claim their evaluation setup is me
The Wire · Newsletter
One careful email,
every Monday.
The week's most important AI stories, lightly edited and personally vouched for. No autoplay, no spam, easy to leave.
Comments · 0
Sign in to join the discussion.
Be the first to leave a thought.