Python implementation of Recursive Language Models for processing arbitrarily long contexts.
RLM is an inference strategy that lets language models recursively call themselves to handle unlimited input context length. Instead of feeding long text directly to the LLM (causing "context rot"), RLM stores context as a variable in a Python REPL environment.
Traditional LLM: Context -> LLM -> Answer (quality degrades with length)
RLM: LLM <-> REPL Environment
|-- context (variable)
|-- llm_query() (recursive calls)
|-- exec() (run code)
v
FINAL(answer)
- Context as Environment: Long text stored in REPL variable, not in LLM prompt
- Code as Tool: LLM writes Python to explore context (
context[:1000],re.findall(), etc.) - Recursive Decomposition: Complex problems split via
llm_query()sub-calls - Self-Correction: LLM sees execution errors and fixes its own code
# Clone the repository
git clone https://github.com/Ray0907/rlm.git
cd rlm
# Install with uv
uv sync
# Set up API key
cp .env.example .env
# Edit .env with your API key (OPENAI_API_KEY or ANTHROPIC_API_KEY)import rlm
result = rlm.run(
query="What is the total salary of all Engineering employees?",
context="""
Employee Records:
- Alice: Engineering, $75,000
- Bob: Marketing, $65,000
- Charlie: Engineering, $85,000
""",
model="anthropic/claude-sonnet-4-5-20250929", # or "gpt-4o-mini"
)
print(result.answer) # $160,000
print(result.iterations) # Number of execute-observe cycles- Query sent to LLM with system prompt explaining REPL environment
- LLM writes code in ```repl blocks to explore context
- Code executed, output returned to LLM
- LLM iterates until it finds the answer
- FINAL(answer) signals completion
| Aspect | Traditional Agent | RLM |
|---|---|---|
| Problem Decomposition | Human predefines workflow | Model decides how to decompose |
| Tool Usage | Fixed tool set | Code as universal tool |
| Control Flow | Human-designed (ReAct) | Model-driven iteration |
| Flexibility | Limited to designed tools | Unlimited (any valid Python) |
Any model supported by LiteLLM:
# Anthropic (Recommended)
LITELLM_MODEL=anthropic/claude-sonnet-4-5-20250929 uv run python examples/simple_qa.py
# OpenAI
LITELLM_MODEL=gpt-4o-mini uv run python examples/simple_qa.py
# Local (Ollama)
LITELLM_MODEL=ollama/llama2 uv run python examples/simple_qa.py# Simple Q&A
uv run python examples/simple_qa.py
# Long document processing
uv run python examples/long_document.pyComparing Haiku 4.5 + RLM vs Opus 4.5 Direct on employee salary aggregation task.
| Records | Context | Expected | RLM Answer | Direct Answer | RLM Cost | Direct Cost | RLM Time | Direct Time | Winner |
|---|---|---|---|---|---|---|---|---|---|
| 100 | 24,390 chars | $1,105,000 | $1,105,000 | $145,000 | $0.0437 | $0.0410 | 43.9s | 3.3s | RLM |
| 200 | 48,835 chars | $2,822,000 | $2,822,000 | $145,000 | $0.0469 | $0.0791 | 53.8s | 3.8s | RLM |
Key Findings:
- RLM with weak model (Haiku 4.5) achieves correct answers while strong model (Opus 4.5) fails on longer contexts
- Cost is comparable or lower with RLM
- Trade-off: RLM takes longer due to multiple iterations
def run(
query: str, # Question to answer
context: str | list, # Data to process (stored in REPL, not sent to LLM)
model: str = "gpt-4o-mini",
max_iterations: int = 20,
verbose: bool = False,
) -> RLMResult@dataclass
class RLMResult:
answer: str # Final answer
iterations: int # Number of iterations
total_input_tokens: int # Total input tokens used
total_output_tokens: int # Total output tokens used
code_executions: int # Number of code blocks executed
sub_calls: int # Number of llm_query() calls
history: list[dict] # Full execution history- Paper: Recursive Language Models - arXiv:2512.24601
- Blog: RLM: Scalable LLM Inference - Author's implementation notes
MIT