Terminal Bench Integration into rLLM #202
Closed
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This PR integrates Terminal Bench’s Terminus 1 agent into rLLM for reproducible terminal task evaluation. Requirements: Python ≥ 3.12,
pip install terminal-bench
, andOPENAI_API_KEY
set. Run the example with:Edited files (key changes)
rllm/rllm/environments/terminal/terminal_terminus.py
TerminalTerminusEnv
that bridges rLLM and TB.Terminal
andDockerComposeManager
, and creates aTmuxSession
for the agent.ParserFactory
.logging_dir
is provided; cleans up containers on close.rllm/rllm/agents/terminal_terminus_agent.py
TerminalTerminusAgent
: thin, state‑holding agent that maintains alternating user/assistant messages and mirrors raw model responses intoAction
objects.rllm/examples/terminal/prepare_terminal_data.py
task_path
,task_id
,instruction
.rllm/examples/terminal/run_terminal.py
TerminalTerminusAgent
,TerminalTerminusEnv
, andTerminalLiteLLMEngine
throughAgentWorkflowEngine
andTerminalWorkflow
.rllm/rllm/workflows/terminal_workflow.py
rllm/workflows/workflow.py
rllm/rllm/integrations/terminal_terminus_1.py
RLLMTerminus
: thin wrapper subclass over TB’sTerminus
exposing private functions and variables.TerminalLiteLLMEngine
: minimal rollout engine delegating completions to TB’sChat
+LiteLLM
and Terminus’_handle_llm_interaction
wrapper.ModelOutput
with assistant JSON text and token counts.