Show HN: Terminal-Bench-RL: Training Long-Horizon Terminal Agents with RL
🤓 Terminal-Bench-RL: Training Long-Horizon Terminal Agents with Reinforcement Learning
TL;DR:
- I successfully built stable RL training infrastructure that scales to 32x H100 GPUs across 4 bare metal nodes for training long-horizon terminal-based coding agents.
- In doing so, I developed Terminal-Agent-Qwen3-32b to become the highest scoring Qwen3 agent on terminal-bench. WITHOUT training! (currently under submission):
- Unfortunately I am too GPU poor to train a SOTA coding agent 😅 (estimated £30k-£50k in compute required), but if anyone has the GPUs, this project should get you there!
This project builds upon the rLLM framework developed by UC Berkeley Sky Lab, extending it with custom environments and infrastructure specifically designed for terminal-based agent training.
📚 Table of Contents
💻💰 Training on $1M worth of compute
This image shows my training code running at full throttle on 32x H100's, distributed across a 4x bare metal node cluster, training Qwen3-32B. Thank you Hyperbolic for such a streamlined experience! This was fun!
Due to the extreme cost of this level of compute, I was not able to run it forever! So I made sure it worked and also ran the code on less extravagent hardware setups too.
Other training runs
I also ran Qwen3-32B training for longer on a 2x bare metal node cluster with 16x H100s:
Also 1 VM instance with 8x H100s:
My longest training run was using 2xA100s on a single VM instance, where I trained Qwen3-8B for over 60 steps:
Note: I did not expect the 8B to begin learning the complex behaviours required to solve the tasks in the dataset. However it was great to run the training through the dataset and ensure the code is stable.
🏆 Placing a spot on the Terminal Bench Leaderboard
Terminal bench is a brilliant benchmark created by Stanford and Laude Institute to quantify agents' ability to complete complex tasks in the terminal.
Through prompt engineering & custom tool design, my Qwen3-32B agent outperformed Stanford's Terminus-Qwen3-235B-30A MoE agent, as well as Deepseek R1 & OpenAI's GPT-4.1 with Codex agent, to become the highest scoring Qwen3 agent on the leaderboard.
The results.json for the eval run can be found here.
I am sure that with the compute budget for training, my agent would climb the leaderboard significantly.
Agent details
My motivation behind this entire project was to place on the leaderboard of terminal bench by using RL to train a sophisticated LLM agent. In order to do so, I developed the tools (inspired by Claude Code) which a capable AI agent would use to help complete complex terminal/coding tasks, as well as a system message which encouraged the agent to use those tools and approach the task in a specific way.
These tools can be found here and include:
- 📝 Todo Management: Planning and tracking task progress
- 📁 File Operations: Read, write, and edit files
- 🔍 Search Tools: Grep, glob, and ls for file exploration
- ⚡ Bash Execution: Run terminal commands with output capture
- 🗒️ Scratchpad: Space for note-taking
- 👍 Task Completion: Signal when the agent believes the task is complete
Note: Technically the agent could have access to only the bash tool and would still have the same capabilites as all these tools above. Saving the development time and maintenance. However by providing clear APIs to specific tools, it enables the agent to understand and leverage tools much more effectively.
🏗️ Action-Based Architecture
The agent communicates through a structured XML/YAML format that ensures reliable parsing and execution:
<todo> operations: - action: add content: "Find and analyze all Python test files" - action: add content: "Run pytest and fix any failing tests" view_all: true todo>
<bash> cmd: 'find . -name "*.py" -path "*/test*" | head -10' timeout_secs: 30 bash>
This architecture provides:
- Type Safety: Each action (bash, file, search, todo) has a dedicated handler with validation
- Error Recovery: Malformed YAML triggers helpful error messages guiding the agent to correct syntax
- Sequential Execution: Actions are processed one at a time with mandatory stop-and-wait behavior
- Consistent Feedback: Every action returns structured results the agent can learn from and adjust its plan
As well as developing these tools, I also wrote out a system prompt which encourages best practices such as:
- Structured Task Execution: Clear problem approach phases (Planning → Exploration → Execution → Verification)
- Multi-Turn Interaction: Action-environment cycle with proper stop-and-wait behavior
- Mandatory Todo Management: Required initial planning and continuous task tracking
- Read-Only Exploration: Gather information before making any changes
With this system message & tool combination + a capable LLM (I chose Qwen3-32B), I was able to place 19th on the terminal bench leaderboard (currently under submission) with a score of 13.75%. This outperformed:
- Terminus agent with Qwen3-235B by Stanford
- Terminus agent with Deepseek-R1 by Stanford
- Codex agent with GPT-4.1 by OpenAI
- Codex agent with codex-mini by OpenAI
The agent can be seen here.
I would be extremely excited to see where Qwen3-32B would be on the leaderboard if I could afford to pay for the compute cost of a proper RL run!
Training details
As mentioned above, the compute costs of a full training run on a 32B LLM for long horizon terminal/coding tasks are not accessible for me, however the training code and dataset is ready to go and has been tested to train stably on hardware setups from 2x A100s all the way to 32x H100s.
⚖️ Reward Design
To provide meaningful supervision during RL, rewards were computed using two complementary methods:
✅ Answer Verification (65% weight)
- Each training datapoint included Python unit tests to verify task completion
- Tests were assigned individual weights to provide granular partial credit
- Test execution ran in the isolated Docker container in which the agent completed its work
- Weighted scoring: passed tests contributed their weight to the final test score
🤖 LLM-as-a-Judge (35% weight)
- Used Claude-4-Sonnet as an external judge to evaluate agent behavior
- Evaluated four primary components:
- Action Output Success (35%): Valid XML actions, successful parsing, error recovery
- Todo Usage & Planning (25%): Initial planning, task tracking, continuous updates
- Phase Adherence (25%): Following the 5-phase workflow (Planning → Exploration → Refinement → Execution → Verification)
- Tool Usage Effectiveness (15%): Appropriate tool selection, purposeful actions
- Applied quality modifiers for error recovery, discovery quality, and efficiency
- Penalized overthinking without action, gaming behaviors, and phase violations
- Scored on HOW the agent worked, not WHETHER the task was completed
🧪 Judge Evaluation System
To ensure the LLM judge provided accurate and consistent scoring during RL training, I developed a simple evaluation system:
- Created test cases showing different agent trajectories
- Tested multiple LLM models as judges including: Kimi K2, Qwen-3-Coder, Claude Sonnet 4, Claude Haiku 3.5, to compare scoring accuracy.
- Found Claude Sonnet 4 provided the most consistent and accurate scoring, correctly identifying issues like lack of exploration and overthinking
- Unfortunately Sonnet-4 is extremely expensive, so it is not very affordable for a 32 rollout, 1650 step run! But it was the only model which understood a good from bad trajectory well enough.
- Many other models (including Haiku 3.5) gave inflated scores to problematic agent behaviors, with some scoring 0.85-0.95 for agents that skipped critical phases
To analyze judge model performance:
# Run evaluation on a specific model uv run python evaluation/llm_as_a_judge_evals/judge_eval.py --model openrouter/openai/gpt-4.1 --attempts 3 # Generate performance report showing best models uv run python evaluation/llm_as_a_judge_evals/report.py
Top 5 Judge Models Performance:
Rank | Model | Pass Rate | Avg Score |
---|---|---|---|
1 | Claude Sonnet 4 | 46.67% | 0.26 |
2 | Claude 3.5 Haiku | 46.67% | 0.70 |
3 | Qwen3 Coder | 26.67% | 0.76 |
4 | Devstral Medium | 23.33% | 0.50 |
5 | Kimi K2 | 23.33% | 0.53 |
Claude Sonnet 4 ranks #1 despite having the same pass rate as Haiku because its significantly lower average score (0.26 vs 0.70) indicates stricter, more accurate judging (on the eval dataset). Lower scores mean the model better identifies problematic agent behaviors that other judges miss.
Other models tested include: GPT-4.1, Gemma-3-27B-IT, Qwen3-32B, and Qwen3-235B-A22B.
🔄 Dynamic LLM Judge Switching
To handle overloaded models, token limits or performance requirements during long training runs, the infrastructure supports hot-swapping between different LLM judge backends:
- Runtime switching without interrupting training process
- Switch between Claude Code CLI and LiteLLM backends as needed
- Useful when hitting API token limits or budget constraints
- See
switch_judge_backend.py
and switching documentation
Example workflow:
# Start with Claude Code CLI python training_scripts/launch_training.py prod_32b_8_gpus # Need to change? Switch to LiteLLM API (Can also proivde env vars to populate into training) python training_scripts/switch_judge_backend.py litellm anthropic/claude-3-opus-20240229 # Later, switch back python training_scripts/switch_judge_backend.py ccode
🏗️ rLLM Integration Architecture
This project extends rLLM's BaseAgent
and BaseEnv
interfaces to create a complete RL training loop:
Terminal Agent (TerminalBenchAgent
)
- Extends rLLM's
BaseAgent
to manage multi-turn conversations between the environment and LLM - Maintains conversation history with system prompt, user instructions, and agent responses
- Tracks complete trajectories with observations, actions, and rewards for GRPO training
Docker Environment (DockerIsolatedEnv
)
- Extends rLLM's
BaseEnv
to provide isolated Docker containers for each training rollout - Each rollout spawns a fresh container from the task's Dockerfile specification
- Executes agent actions through Docker, returning real terminal output as observations
- Computes rewards via software tests (65%) and LLM judge evaluation (35%)
- Ensures complete isolation between parallel rollouts for diverse solution exploration
The training loop follows rLLM's standard flow: reset → observation → LLM inference → action → environment step → reward → repeat. For some more details on this, see docs/rllm_specific/understanding_of_agents_and_envs.md
.
🔄 Training & Rollout Details
This project leveraged Group Relative Policy Optimization (GRPO), which encourages the model to learn from relative advantages within a group of sampled responses, making it particularly well-suited for structured reasoning tasks.
🔁 Rollout Strategy
- 16 samples per training prompt (configurable), each generated with a temperature of 1.2 to encourage diversity while maintaining coherence
- Complete trajectory isolation via Docker containers for each rollout
⚙️ Training Configuration Presets
The training infrastructure supports multiple hardware configurations through simple preset selection:
# Quick test run on 2x A100s python training_scripts/launch_training.py test_8b_2_gpus # Production run on 8x H100s python training_scripts/launch_training.py prod_32b_8_gpus # Scale to 32x H100s across 4 nodes python training_scripts/launch_training.py prod_32b_4x8_h100
Available presets scale from development to production:
test_8b_2_gpus
: Quick validation with Qwen3-8B on 2x 80GB GPUsrunway_32b_4_gpus
: Standard training with Qwen3-32B on 4x GPUsprod_32b_8_gpus
: Production setup on single 8x GPU nodeprod_32b_2x8_h100
: Multi-node training on 16x H100s (2 nodes)prod_32b_4x8_h100
: Full scale on 32x H100s (4 nodes)
📊 Key Hyperparameters (Production Config)
- Algorithm: GRPO with rejection sampling
- Learning Rate: 1e-6 with gradient clipping (max norm = 0.1)
- Batch Configuration: Adaptive based on GPU count
- Sequence Length: 32,768 tokens max
- Max single response Length: 4,000 tokens per response
- Training Duration: 10 epochs through dataset
- Parallelization: Automatic tensor/sequence parallel sizing
- Precision: bfloat16 for efficiency
- Monitoring: WandB integration + detailed trajectory logging
The infrastructure automatically handles:
- Model distribution across GPUs with optimal tensor/sequence parallelism
- Memory optimization based on hardware (0.7-0.85 GPU utilization)
- Docker container lifecycle management for isolated rollouts
- Checkpoint saving and optional HuggingFace upload
🗂️ Dataset Details
As part of this repo are 331 training tasks, ranging from easy to extremely hard complexity.
🤓🤖 I developed a comprehensive multi-agent synthetic data pipeline powered by Claude Code + Opus-4 to generate and (importantly) validate each datapoint. The repo for this framework can be found here!
📊 Dataset Structure
Each training datapoint in dataset/latest_verified.csv
contains:
{ "task_id": "git-deployment-workflow-setup", # Unique task identifier "difficulty": "hard", # easy|medium|hard|extremely_hard "category": "system-administration", # Task category "prompt": "I need help setting up a simple CI/CD system...", # The actual task instruction "dockerfile": "FROM ghcr.io/laude-institute/t-bench/ubuntu-24-04:latest\n...", # Docker environment setup "test_functions": "def test_hook_script_executable():\n ...", # Pytest verification code "test_weights": { # Weight for each test (for partial credit) "test_hook_script_executable": 0.35, "test_nginx_service_running": 0.15, "test_deployment_works_correctly": 0.50 }, "additional_files": { # Optional files to include in container "backup_config.json": "{\n \"schedules\": [...", "collision_detector.py": "#!/usr/bin/env python3\n..." } }
🐳 Training Environment Creation
During training, each task generates multiple parallel rollouts (trajectories), with each rollout executed in complete isolation:
-
Parallel Rollout Generation:
- N_ROLLOUTS: Configurable per training preset (e.g., 4 for test runs, 16 for production)
- Each rollout runs simultaneously in its own Docker container
- Complete independence between rollouts allows diverse solution exploration
-
Per-Rollout Environment Setup:
- New Docker container created using the task's
dockerfile
- Container starts with a clean filesystem based on the Dockerfile
- Any
additional_files
are written to the container before the agent begins - Agent receives the task
prompt
as its initial instruction
- New Docker container created using the task's
-
Agent Execution: The agent interacts with its isolated environment using tools (bash, file operations, etc.)
-
Verification: After completion,
test_functions
are executed to compute the test score -
Cleanup: Container is destroyed after trajectory completion
🧹 Docker Resource Management
Due to the high volume of Docker containers being created (up to 24 containers running in parallel during training), the infrastructure includes automatic resource cleanup:
- Automatic cleanup daemon periodically removes stopped containers and unused networks
- Runs every 2 minutes to prevent resource exhaustion
- Immediate cleanup on training startup to ensure clean environment
- See
docker_cleanup.py
anddocker_env.py
for cleanup implementation details
📂 Dataset Preparation Pipeline
The training data flows through a multi-stage preparation pipeline before being used by rLLM:
-
CSV Dataset (
dataset/latest_verified.csv
): Contains task definitions with prompts, Dockerfiles, test functions, and weights -
Terminal Bench Tasks (via
convert_dataset_to_tasks.py
):- Converts each CSV row into a Terminal Bench task directory structure in order to leverage the TerminalBench Docker harness and unit test runner + parser for reward calculation during RL run.
- Runs in parallel to speed up conversion of tasks
-
Parquet Format (via
tasks_to_parquet_converter.py
):- Creates
data/terminal_bench/*.parquet
files for rLLM's data loader - Each row contains
extra_info
dict with: task_name, task_path, instruction, test_weights, dockerfile_contents, etc. - This
extra_info
is passed toDockerIsolatedEnv.from_dict()
during training to create environments
- Creates
🚀 Getting Started
Development Setup
Clone the repository and install dependencies:
git clone --recurse-submodules https://github.com/Danau5tin/terminal-bench-rl.git
cd terminal-bench-rl
uv sync
That's it! UV will handle all dependencies automatically.
Note: This project includes a forked version of the terminal-bench repository with the Python version requirement reduced from 3.13 to 3.12 for compatibility.
Terminal Bench Evaluation Reproduction
After setup, you can reproduce my Terminal Bench evaluation [result](./tbench_eval_run_results.jsons:
# Set environment variables, example: export LITE_LLM_API_KEY="your_huggingface_token" export LITE_LLM_API_BASE="https://router.huggingface.co/v1" export LITELLM_MODEL="openai/Qwen/Qwen3-32B:nebius" # Run the evaluation ./evaluation/terminal_bench_eval/run_eval.sh
This will run the agent with the same configuration that achieved 13.75% on the leaderboard.
Training Deployment
Single Node Training
For detailed single-node training setup:
Multi-Node Training
For distributed training across multiple nodes:
🔮 Future Improvements
Given more time and resources, several enhancements would further improve training effectiveness:
🚀 Full Training Run
At this point, I feel as if I have created a great cake recipe, put everything in place to make it, and yet I can't afford to pay for the oven!!! 😂
- With sufficient compute budget, I'd run full training and then evaluate the trained model. I am confident that it would outperform the untrained Qwen3-32B.
🤓 Curriculum learning
- I'd also implement curriculum learning to progressively increase task difficulty, starting with easy and medium tasks, with a high weight on the judge reward to encourage behaviour such as using todo list, etc.
- Then when these easier tasks are being completed with the correct behaviour, I would remove the judge completely, and move to 100% software verification of the tasks. Allowing the model to lose the strict constraints of what it learnt at first, and explore optimised routes to success from a principled base.
📊 Dataset Expansion
- Generate more datapoints: Currently I have only ~331 tasks due to time constraints
- A larger dataset (1000+ tasks) would provide more diverse scenarios and tech stacks for robust training
- I'd also take the time to give each datapoint careful validation, which takes time but ensures quality
🎯 Smart Data Filtering
- Pre-filter trivial datapoints: Before training, I'd run the untrained model on all tasks
- I'd remove datapoints where the model achieves zero or perfect scores (0.0 or 1.0 reward)
- This would save GPU time by focusing training on tasks where the model can actually learn
Acknowledgements
- I want to thank everyone who contributed to Terminal Bench, it is a great benchmark and just what I've been looking for!
- A big thank you to the minds behind rLLM too! I did try this with other training frameworks, and they contained critical bugs which are not solved today, so it was a breath of fresh air to use a framework that performed so well!
- A thank you for the Claude Code team for inspiring the tool use and agent behavioural approach used in this project!
- A big thank you to the research team at Anthropic, for creating such great models. Opus-4 assisted me heavily in this project, especially when debugging problems on deployed clusters of distributed GPUs.
This project was a lot of fun! Thanks for reading! Dan
What's Your Reaction?






