ITBench

📢 Announcements

Latest Updates

[January 21, 2026] IBM Research has published the Enterprise Agents and Benchmarks collection on Hugging Face, featuring ITBench alongside other enterprise AI agent ecosystems and benchmarks. View the collection.
[December 19, 2025] UC Berkeley's MAST team published a blog post analyzing ITBench SRE agent traces using MAST (Multi-Agent System Failure Taxonomy), revealing structured failure signatures that explain how and why agents fail. Read more.
[December 2, 2025] ITBench is now available on Kaggle! IBM has partnered with Kaggle to launch new AI leaderboards for enterprise tasks, including ITBench. Read more.
[November 30, 2025] A big shoutout to @phylisscity, @preespp, @tylrnguyen, @VincentCCandela, and @RMalone8 from Boston University for their contributions to ITBench, and to @Red-GV for mentoring them!
[September 18, 2025] Our paper STRATUS: A Multi-agent System for Autonomous Reliability Engineering of Modern Clouds was accepted at NeurIPS 2025. Read the paper.
[July 17, 2025] ITBench was presented as an oral at ICML 2025 (Oral 6A: Applications in Agents and Coding) in Vancouver. View the talk.
[June 13, 2025] Identified 25+ additional scenarios to be developed over the summer.
[May 2, 2025] 🚀 ITBench now provides fully-managed scenario environments for everyone! Our platform handles the complete workflow—from scenario deployment to agent evaluation and leaderboard updates. Visit our GitHub repository here for guidelines and get started today.
[February 28, 2025] 🏆 Limited Access Beta: Invite-only access to the ITBench hosted scenario environments. ITBench handles scenario deployment, agent evaluation, and leaderboard updates. To request access, e-mail us here.
[February 7, 2025] 🎉 Initial release! Includes research paper, self-hosted environment setup tooling, sample scenarios, and baseline agents.

Overview

ITBench measures the performance of AI agents across a wide variety of complex and real-world inspired IT automation tasks targeting three key use cases:

Use Case	Focus Area
SRE (Site Reliability Engineering)	Availability and resiliency
CISO (Compliance & Security Operations)	Compliance and security enforcement
FinOps (Financial Operations)	Cost efficiencies and ROI optimization

Key Features

Real-world representation of IT environments and incident scenarios
Open, extensible framework with comprehensive IT coverage
Push-button workflows and interpretable metrics
Kubernetes-based scenario environments

What's Included

ITBench enables researchers and developers to replicate real-world incidents in Kubernetes environments and develop AI agents to address them.

We provide:

Push-button deployment tooling for environment setup (open-source)
Framework for recreating realistic IT scenarios using the deployment tooling:
- 6 SRE scenarios and 21 mechanisms (open-source)
- 4 categories of CISO scenarios (open-source)
- 1 FinOps scenario (open-source)
Two reference AI agents:
- SRE (Site Reliability Engineering) Agent (open-source)
- CISO (Chief Information Security Officer) Agent (open-source)
Fully-managed leaderboard for agent evaluation and comparison

Roadmap

Timeline	Key Deliverables
July 2025	• Refactor leading to a scenario specification generator and runner allowing for most (if not all) mechanisms to be re-used across diverse applications and microservices • Implementation of 10 of the additional scenarios identified
August 2025	• SRE-Agent-Lite: Lightweight agent to assist non-systems personnel with environment debugging • Snapshot & Replay: Data capture and replay capabilities • Implementation of 15 of the additional scenarios to be developed over the summer
Fall 2025	BYOA (Bring Your Own Application): Support for custom application integration

Leaderboard

The ITBench Leaderboard tracks agent performance across SRE, FinOps, and CISO scenarios. We provide fully managed scenario environments while researchers/developers run their agents on their own systems and submit their outputs for evaluation.

Domain	Leaderboard
SRE	View SRE Leaderboard
CISO	View CISO Leaderboard

Get Started: Visit docs/leaderboard.md for access and evaluation guidelines.

Scenarios

ITBench incorporates a collection of problems that we call scenarios. Each scenario is deployed in an operational environment where specific problems occur.

Examples of Scenarios

SRE: Resolve "High error rate on service checkout" in a Kubernetes environment
CISO: Assess compliance posture for "new control rule detected for RHEL 9"
FinOps: Identify and resolve cost overruns and anomalies

Find all scenarios: Scenarios repository

Agents

Two baseline agents are being open-sourced with ITBench, built using the CrewAI framework.

Agent Features

Configurable LLMs: watsonx, Azure, or vLLM support
Natural language tools: Interactions with the environment for information gathering

Available Agents

Agent	Repository
SRE Agent	itbench-sre-agent
CISO Agent	itbench-ciso-caa-agent

How to Cite

@misc{jha2025itbench,
      title={ITBench: Evaluating AI Agents across Diverse Real-World IT Automation Tasks},
      author={Jha, Saurabh and Arora, Rohan and Watanabe, Yuji and others},
      year={2025},
      url={https://github.com/IBM/itbench-sample-scenarios/blob/main/it_bench_arxiv.pdf}
}

Join the Discussion

Have questions or need help getting started with ITBench?

Create a GitHub issue for bug reports or feature requests
Join our Discord community for real-time discussions
For formal inquiries, please see the contacts section

Contacts

General inquiries: agent-bench-automation@ibm.com
Saurabh Jha: saurabh.jha@ibm.com
Yuji Watanabe: muew@jp.ibm.com

Name		Name	Last commit message	Last commit date
Latest commit History 516 Commits
.github		.github
docs		docs
images		images
scenarios		scenarios
.ansible-lint		.ansible-lint
.ansible-lint-ignore		.ansible-lint-ignore
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
.secrets.baseline		.secrets.baseline
CONTRIBUTORS.md		CONTRIBUTORS.md
LEADERBOARD_CISO.md		LEADERBOARD_CISO.md
LEADERBOARD_SRE.md		LEADERBOARD_SRE.md
LICENSE		LICENSE
README.md		README.md
ansible.cfg		ansible.cfg
it_bench_arxiv.pdf		it_bench_arxiv.pdf
renovate.json		renovate.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ITBench

📢 Announcements

Latest Updates

Overview

Key Features

What's Included

Roadmap

Leaderboard

Scenarios

Examples of Scenarios

Agents

Agent Features

Available Agents

How to Cite

Join the Discussion

Contacts

About

Uh oh!

Releases 9

Packages

Uh oh!

Contributors 24

Uh oh!

Languages

License

itbench-hub/ITBench

Folders and files

Latest commit

History

Repository files navigation

ITBench

📢 Announcements

Latest Updates

Overview

Key Features

What's Included

Roadmap

Leaderboard

Scenarios

Examples of Scenarios

Agents

Agent Features

Available Agents

How to Cite

Join the Discussion

Contacts

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 9

Packages 0

Uh oh!

Contributors 24

Uh oh!

Languages

Packages