Terminal-Bench 2.0 is introduced alongside Harbor, a new framework for testing agents in containers

The developers of Terminal-Bench, a benchmark suite for evaluating the performance of autonomous AI agents on real-world terminal-based tasks, have released Version 2.0 next to Harbora new framework for testing, improving and optimizing AI agents in container environments.

The dual release aims to address long-standing issues in testing and optimizing AI agents, particularly those designed to operate autonomously in realistic developer environments.

With a more difficult and rigorously reviewed set of tasks, Terminal-Bench 2.0 replaces version 1.0 as the standard for assessing the capabilities of frontier models.

Harbor, the associated runtime framework, enables developers and researchers to scale evaluations across thousands of cloud containers and integrates with both open source and proprietary agents and training pipelines.

“Harbor is the package we would have liked when developing Terminal-Bench." wrote co-creator Alex Shaw on X. "It is aimed at developers and researchers of agents, models and benchmarks who want to evaluate and improve agents and models."

Higher bar, cleaner data

Terminal-Bench 1.0 was quickly adopted after its launch Release in May 2025which is becoming a standard benchmark for evaluating agent performance across the field of AI-powered agents operating in developer-style terminal environments. These agents interact with systems via the command line, mimicking how developers work behind the scenes of the graphical user interface.

However, its wide scope was fraught with inconsistencies. Several tasks have been identified by the community as poorly specified or unstable due to external service changes.

Version 2.0 addresses these issues directly. The updated suite includes 89 tasks, each of which has undergone several hours of manual and LLM-assisted validation. The focus is on making tasks solvable, realistic and clearly specified, raising the difficulty threshold while improving reliability and reproducibility.

A notable example is this download-youtube Task removed or refactored in 2.0 due to its reliance on unstable third-party APIs.

“Astute Terminal Bench fans may notice that SOTA performance is comparable to TB1.0, although we argue that TB2.0 is more difficult,” Shaw noted on X. “We believe this is because the task quality is significantly higher in the new benchmark.”

Harbor: Unified rollouts at scale

The team started parallel to the benchmark update Harbora new framework for running and evaluating agents in cloud-deployed containers.

Harbor supports large-scale rollout infrastructures with compatibility with major vendors such as: Daytona And Modal.

Harbor is designed to generalize across agent architectures and supports:

Evaluating any container-installable agent
Scalable pipelines for supervised fine-tuning (SFT) and reinforcement learning (RL).
Custom benchmark creation and deployment
Full integration with Terminal Bench 2.

When creating the new benchmark, Harbor was used internally to run tens of thousands of rollouts. It is now publicly available via harborframework.comwith documentation for testing and submitting agents to the public leaderboard.

First results: GPT-5 leads to task success

Initial results from the Terminal Bench 2.0 leaderboard show that OpenAI’s Codex CLI (Command Line Interface), a GPT-5-based variant, is at the top with a success rate of 49.6% – the highest among all agents tested to date.

Close behind are other GPT-5 variants and Claude Sonnet 4.5-based agents.

Top 5 agent results (Terminal Bench 2.0):

Codex CLI (GPT-5) – 49.6%
Codex CLI (GPT-5 Codex) – 44.3%
OpenHands (GPT-5) – 43.8%
Terminus 2 (GPT-5 Codex) – 43.4%
Terminus 2 (Claude Sonnet 4.5) – 42.8%

The close clustering of top models indicates active competition between the platforms, with no single agent solving more than half of the tasks.

Transmission and Use

To test or submit an agent, users install Harbor and run the benchmark using simple CLI commands. Five benchmark runs are required to submit to the leaderboard. The results can be emailed to developers along with job listings for validation.

harbor run -d terminal-bench@2.0 -m "" -A "" –n-attempts 5 –jobs-dir

Terminal-Bench 2.0 is already being integrated into research workflows focused on agentic reasoning, code generation and tool usage. According to co-creator Mike Merrill, a postdoctoral fellow at Stanford University, a detailed preprint covering the verification process and design methodology behind the benchmark is currently in the works.

Strive for standardization

The joint release of Terminal-Bench 2.0 and Harbor marks a step toward a more consistent and scalable agent evaluation infrastructure. As LLM agents become more common in developer and operational environments, the need for controlled, reproducible testing has grown.

These tools provide a potential foundation for a unified assessment stack and support model improvement, environment simulation, and benchmark standardization across the AI ecosystem.

Higher bar, cleaner data

Harbor: Unified rollouts at scale

First results: GPT-5 leads to task success

Transmission and Use

Strive for standardization

Leave a ReplyCancel Reply