Skip to content

Conversation

JasonWei05
Copy link

@JasonWei05 JasonWei05 commented Aug 28, 2025

This PR integrates Terminal Bench’s Terminus 1 agent into rLLM for reproducible terminal task evaluation. Requirements: Python ≥ 3.12, pip install terminal-bench, and OPENAI_API_KEY set. Run the example with:

python examples/terminal/run_terminal.py

Edited files (key changes)

  • rllm/rllm/environments/terminal/terminal_terminus.py

    • Adds TerminalTerminusEnv that bridges rLLM and TB.
    • Spins up Docker containers via TB Terminal and DockerComposeManager, and creates a TmuxSession for the agent.
    • Builds the initial prompt from TB’s Terminus prompt template; executes command batches; triggers unit tests on completion or episode cap; parses results via TB ParserFactory.
    • Records asciinema markers when logging_dir is provided; cleans up containers on close.
  • rllm/rllm/agents/terminal_terminus_agent.py

    • TerminalTerminusAgent: thin, state‑holding agent that maintains alternating user/assistant messages and mirrors raw model responses into Action objects.
    • Terminal‑specific logic is kept out of the agent and handled by the environment.
  • rllm/examples/terminal/prepare_terminal_data.py

    • Converts TB datasets into minimal rLLM task dicts: task_path, task_id, instruction.
    • Defers parser/timeouts/shell settings to TB’s task config (loaded by the env).
  • rllm/examples/terminal/run_terminal.py

    • Runnable example that wires TerminalTerminusAgent, TerminalTerminusEnv, and TerminalLiteLLMEngine through AgentWorkflowEngine and TerminalWorkflow.
    • Loads tasks from TB registry and prints final accuracy.
  • rllm/rllm/workflows/terminal_workflow.py

    • Workflow that runs multi‑step interactions with Terminus; evaluates tests on completion or error and surfaces a consistent termination.
  • rllm/workflows/workflow.py

    • Minor edit to hand uid to the environment's reset function.
  • rllm/rllm/integrations/terminal_terminus_1.py

    • RLLMTerminus: thin wrapper subclass over TB’s Terminus exposing private functions and variables.
    • TerminalLiteLLMEngine: minimal rollout engine delegating completions to TB’s Chat + LiteLLM and Terminus’ _handle_llm_interaction wrapper.
    • Returns ModelOutput with assistant JSON text and token counts.

from terminal_bench.llms.chat import Chat


class TerminalLiteLLMEngine(RolloutEngine):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this engine is for the sole purpose of TerminalBench, can you move it to /integrations/terminal_terminus_1.py.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done!

@JasonWei05 JasonWei05 closed this Sep 1, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants