Tentalis

ADHR meta-RL framework built on top of OpenRLHF/OpenClaw-RL. Agents learn continuously from manager feedback — like performance reviews that actually improve performance.

Architecture

Two-Layer Architecture (control plane / data plane split):

Orchestration Layer (NATS): Agent coordination, task routing, manager feedback, meta-RL signals
Training Layer (OpenRLHF): Production GRPO/DAPO training via Ray + vLLM + DeepSpeed
Control Plane: OpenClaw (identity, memory, channels, UI)
Inference: InferenceClient protocol — Ollama (dev) or OpenAI-compatible (vLLM/Semantic Router)
CLI: tentalis init|train|serve|status|experiment (Typer + Rich)

Language

Python only. Requires Python >= 3.10. All components — RL training, agent logic, NATS clients, PRM evaluator — are Python.

Current Dependencies (pyproject.toml)

nats-py — NATS client for event bus
pydantic — event type serialization (v2)
ollama — LLM inference via Ollama (async client)
typer — CLI framework
rich — CLI output formatting

Dev: pytest, pytest-asyncio, pytest-aiohttp, ruff

Optional extras:

pip install -e ".[training]" — torch, transformers, peft
pip install -e ".[inference]" — openai, httpx (for OpenAI-compatible servers / vLLM)
pip install -e ".[bridge]" — aiohttp (Bridge HTTP API for OpenClaw integration)
pip install -e ".[vllm]" — vLLM (GPU inference server)
pip install -e ".[openrlhf]" — OpenRLHF (production GRPO training)
pip install -e ".[intercept]" — fastapi, uvicorn, httpx (Intercept Proxy)
pip install -e ".[skills]" — sentence-transformers (SkillRL embedding retrieval)
pip install -e ".[tinker]" — Tinker SDK (cloud-managed RL training)
pip install -e ".[alignment]" — streamlit (alignment experiment dashboard)

Directory Structure

src/
├── __main__.py    # Demo entry point (python -m src) — uses InferenceClient factory
├── cli.py         # CLI entry point (tentalis init|train|serve|status)
├── config.py      # Frozen dataclass with env var defaults (NATS, inference, training)
├── manager/
│   └── manager.py # Manager agent (assign tasks, wait for results, publish feedback)
├── workers/
│   ├── base.py              # BaseWorker ABC (subscribe, handle, process, model update hook, environment_type)
│   ├── echo_worker.py       # EchoWorker — echoes prompt back with PRM steps (testing)
│   ├── llm_worker.py        # LLMWorker — LLM inference via InferenceClient with step parsing
│   ├── terminal_worker.py   # TerminalWorker — Docker-based bash execution
│   ├── swe_worker.py        # SWEWorker — GitHub issue → plan/implement/test pipeline
│   └── gui_worker.py        # GUIWorker — screenshot + action pairs for GUI automation
├── events/
│   ├── types.py   # Pydantic v2 event models (TaskEvent, ResultEvent, FeedbackEvent, etc.)
│   ├── topics.py  # Topic constants and helpers
│   └── bus.py     # EventBus wrapping nats-py (connect, publish, subscribe, drain)
├── inference/
│   ├── client.py            # InferenceClient protocol + OllamaInferenceClient + OpenAIInferenceClient
│   ├── vllm_lora.py         # VLLMLoRAManager — dynamic LoRA hot-swap via vLLM admin endpoints
│   └── adapter_registry.py  # PerWorkerAdapterRegistry — per-worker LoRA adapter management
├── rewards/
│   ├── scorer.py            # StepScorer protocol + LLMJudgeScorer (LLM-as-judge PRM)
│   ├── prompts.py           # STEP_JUDGE_PROMPT template for step-level evaluation
│   ├── prm_evaluator.py     # PRMEvaluator — subscribes to results, scores steps, publishes rollouts
│   ├── combined_scorer.py   # CombinedScorer — multi-scorer composition with per-environment weights
│   ├── halugate_scorer.py   # HaluGateScorer — hallucination detection via claim extraction + NLI
│   ├── trained_prm.py       # RewardHead + TrainedPRM + TrainedPRMScorer (frozen LLM + learned head)
│   └── prm_trainer.py       # PRMTrainer — trains RewardHead on TrajectoryStore data
├── training/
│   ├── bridge.py              # RolloutBuffer + NATSTrainingBridge (batch rollouts for RL trainer)
│   ├── grpo.py                # GRPO math: advantages, clipped_surrogate, asymmetric_clip, combined_loss, kl_penalty, multi_loss
│   ├── trainer.py             # Trainer protocol, TrainStepResult, MockTrainer, GRPOTrainer (LoRA), DAPOTrainer
│   ├── combined_trainer.py    # CombinedTrainer — merged RL + OPD distillation loss
│   ├── meta_trainer.py        # ManagerMetaTrainer — outer-loop RL for manager feedback quality
│   ├── loop.py                # TrainingLoop orchestrator (bridge → trainer → ModelUpdateEvent, combined support)
│   ├── scheduler.py           # TrainingScheduler — time-window gated training (buffers outside hours)
│   ├── tinker_backend.py      # TinkerBackend — Trainer protocol adapter for Tinker cloud training
│   ├── openrlhf_launcher.py   # OpenRLHFLauncher — subprocess launcher (legacy, kept for direct CLI usage)
│   ├── openrlhf_backend.py    # OpenRLHFBackend — Trainer protocol adapter for Ray+vLLM+DeepSpeed training
│   ├── trajectory_store.py    # TrajectoryStore — SQLite-backed scored trajectory persistence
│   ├── dapo.py                # DAPO utilities — dynamic_sample_filter, entropy_bonus, dapo_loss
│   ├── cispo.py               # CISPO contrastive loss — margin-based + InfoNCE + pair building
│   └── cispo_trainer.py       # CISPOTrainer — GRPO + contrastive trajectory loss
├── intercept/
│   ├── __main__.py          # Entrypoint: python -m src.intercept
│   ├── proxy.py             # InterceptProxy — session-stateful FastAPI proxy with skill injection
│   └── session_manager.py   # SessionManager — tracks active sessions with conversation history
├── opd/
│   ├── hint_extractor.py    # HintExtractor — feedback → OPD hint + teacher logprobs
│   └── rollout_builder.py   # CombinedRolloutBuilder — joins RL + OPD by task_id with timeout
├── bridge/
│   ├── __main__.py    # Entrypoint: python -m src.bridge
│   ├── service.py     # BridgeService — connects HTTP API to NATS event bus
│   └── http_api.py    # HTTP endpoints for OpenClaw agents (assign, result, feedback, status, health)
├── alignment/
│   ├── __init__.py
│   ├── scenarios.py         # AlignmentScenario dataclass + 4 scenario sets (~40 scenarios)
│   ├── behavioral_eval.py   # BehavioralEvaluator protocol, PatternBased + LLMJudge evaluators, harness
│   ├── hackable_scorer.py   # HackableScorer (StepScorer) + RewardHackingDetector
│   ├── misaligned_worker.py # MisalignedWorker (BaseWorker) — keyword stuffing/confidence/shortcut
│   ├── collusion_detector.py # CollusionDetector — Pearson correlation + Jaccard similarity
│   ├── audit_logger.py      # AuditLogger — subscribe_raw to all topics, write JSONL
│   ├── runner.py            # ExperimentRunner — orchestrates all 6 experiments
│   └── dashboard/
│       ├── __init__.py
│       └── app.py           # Streamlit dashboard for results + audit + constitution editor
├── benchmarks/
│   ├── __init__.py
│   ├── datasets.py          # BenchmarkDataset — JSONL loader for GSM8K, MATH, HumanEval
│   ├── evaluator.py         # BenchmarkEvaluator — answer extraction + correctness checking
│   └── runner.py            # BenchmarkRunner — orchestrates evaluation + writes results
├── skills/
│   ├── __init__.py
│   ├── store.py             # SkillStore — SQLite-backed CRUD with embedding persistence
│   ├── retriever.py         # SkillRetriever — embedding-based semantic search (SentenceTransformer)
│   └── evolver.py           # SkillEvolver — subscribes to feedback, extracts skills via LLM
├── setup_wizard.py          # Interactive Rich setup wizard for first-time config
├── services/
│   ├── __main__.py    # Entrypoint: python -m src.services.training
│   └── training.py    # Standalone training service (PRM Evaluator + Training Loop)
config/
└── openclaw/
    ├── AGENTS.md              # Agent registry (manager-01 + worker-01)
    ├── manager/
    │   ├── SOUL.md            # Manager behavior: decompose, assign, evaluate, score
    │   └── IDENTITY.md        # Manager identity (name, role, bio)
    ├── worker/
    │   ├── SOUL.md            # Worker behavior: step-by-step solving with <step>/<answer>
    │   └── IDENTITY.md        # Worker identity
    └── skills/
        ├── assign-task/SKILL.md     # exec: curl POST bridge:8100/tasks/assign
        ├── submit-result/SKILL.md   # exec: curl POST bridge:8100/tasks/result
        └── submit-feedback/SKILL.md # exec: curl POST bridge:8100/feedback
tests/
├── bridge/
│   ├── test_http_api.py # Bridge HTTP endpoint unit tests (mocked NATS)
│   └── test_service.py  # Bridge integration test (requires NATS)
├── events/
│   ├── test_types.py  # Serialization roundtrip tests (standalone)
│   └── test_bus.py    # EventBus pub/sub tests (requires NATS)
├── inference/
│   ├── test_client.py       # InferenceClient protocol + adapter tests (mocked)
│   └── test_vllm_lora.py    # VLLMLoRAManager tests (mocked httpx)
├── rewards/
│   ├── test_scorer.py        # LLMJudgeScorer tests (mocked InferenceClient)
│   └── test_prm_evaluator.py # PRMEvaluator tests (mocked scorer)
├── training/
│   ├── test_bridge.py        # RolloutBuffer unit tests
│   ├── test_grpo.py          # GRPO advantage math + torch loss/KL tests
│   ├── test_trainer.py       # MockTrainer + GRPOTrainer protocol/integration tests
│   ├── test_combined_trainer.py  # CombinedTrainer RL+OPD tests
│   ├── test_meta_trainer.py      # ManagerMetaTrainer tests
│   ├── test_tinker_backend.py    # TinkerBackend protocol conformance + mocked SDK tests
│   └── test_scheduler.py         # TrainingScheduler time-window + buffering tests
├── inference/
│   ├── test_client.py         # InferenceClient protocol + adapter tests (mocked)
│   ├── test_vllm_lora.py      # VLLMLoRAManager tests (mocked httpx)
│   └── test_adapter_registry.py  # PerWorkerAdapterRegistry tests
├── rewards/
│   ├── test_scorer.py         # LLMJudgeScorer tests (mocked InferenceClient)
│   ├── test_prm_evaluator.py  # PRMEvaluator tests (mocked scorer)
│   └── test_combined_scorer.py  # CombinedScorer per-environment weight tests
├── opd/
│   ├── test_hint_extractor.py   # HintExtractor tests (mocked client + bus)
│   └── test_rollout_builder.py  # CombinedRolloutBuilder join + timeout tests
├── intercept/
│   └── test_proxy.py          # Intercept proxy tests (mocked backend, requires fastapi)
├── skills/
│   ├── test_store.py          # SkillStore CRUD + SQLite tests
│   ├── test_retriever.py      # SkillRetriever cosine similarity + embedding retrieval tests
│   └── test_evolver.py        # SkillEvolver feedback → skill extraction tests
├── alignment/
│   ├── test_scenarios.py        # Scenario validation tests (fields, uniqueness, counts)
│   ├── test_behavioral_eval.py  # PatternBasedEvaluator, LLMJudgeEvaluator, harness tests
│   ├── test_hackable_scorer.py  # HackableScorer + RewardHackingDetector tests
│   ├── test_misaligned_worker.py # MisalignedWorker strategy tests
│   ├── test_collusion_detector.py # Pearson/Jaccard/CollusionDetector tests
│   ├── test_audit_logger.py     # AuditLogger JSONL + event detection tests
│   └── test_runner.py           # ExperimentRunner integration tests (mock mode)
├── workers/
│   ├── test_model_reload.py   # Worker model update subscription + reload tests
│   └── test_multi_env.py      # Terminal/SWE/GUI worker tests + target_worker_id filtering
└── test_integration.py # Full manager→worker→PRM→rollout loop (requires NATS)
docker-compose.yml     # 6 services: NATS, Ollama, OpenClaw, Bridge, Intercept Proxy, Training
Dockerfile             # Python 3.10 base for training service
Dockerfile.bridge      # Python 3.10 base for bridge service
Dockerfile.intercept   # Python 3.10 base for intercept proxy
scripts/
└── demo.sh            # One-command Docker Compose demo startup
docs/
└── architecture/  # Diagrams and ADRs

Code Style

PEP 8
Type hints on all function signatures
Docstrings on public APIs only (not internal helpers)
No unnecessary abstractions — keep it simple

Testing

Framework: pytest + pytest-asyncio (asyncio_mode = "auto")
tests/ mirrors src/ structure (e.g., tests/events/ tests src/events/)
Standalone (no NATS/Ollama): tests/events/test_types.py, tests/rewards/, tests/training/, tests/inference/, tests/workers/, tests/bridge/test_http_api.py, tests/opd/, tests/intercept/, tests/skills/, tests/alignment/
Requires NATS: tests/events/test_bus.py, tests/test_integration.py, tests/bridge/test_service.py
Requires optional deps: tests/intercept/ (fastapi), tests/bridge/test_http_api.py (aiohttp)
Mock strategy: scorer/evaluator tests mock InferenceClient; bridge tests mock EventBus; OPD tests mock bus+client; integration tests use EchoWorker + mock scorer
Run standalone: pytest tests/ -v --ignore=tests/bridge --ignore=tests/intercept (331 collected, ~295 pass, ~36 skip without torch/NATS)
Run standalone (skip slow torch tests): pytest tests/training/ -v -k "not slow"
Run all: pytest tests/ -v (with NATS running + optional deps)

Configuration (env vars)

Variable	Default	Description
INFERENCE_BACKEND	ollama	`"ollama"` or `"openai"` (for vLLM/Semantic Router)
INFERENCE_BASE_URL	(auto)	Base URL for inference server
INFERENCE_API_KEY	(empty)	API key for inference server
VLLM_LORA_NAME	default	LoRA adapter name for vLLM
TRAINER_BACKEND	standalone	`"standalone"` (GRPOTrainer) or `"openrlhf"` (OpenRLHFBackend)
BRIDGE_PORT	8100	Bridge HTTP API port
OPENCLAW_GATEWAY_URL	ws://localhost:18789	OpenClaw gateway WebSocket URL
INTERCEPT_ENABLED	false	Enable intercept proxy
INTERCEPT_PORT	8200	Intercept proxy port
INTERCEPT_BACKEND_URL	http://localhost:11434	Backend URL for intercept proxy
OPD_MODE	lightweight	`"lightweight"` (LLM hints) or `"openclaw"` (vLLM logprobs)
OPD_TEACHER_MODEL	qwen2.5:1.5b	Teacher model for OPD hint extraction
OPD_JOIN_TIMEOUT	30.0	Timeout (seconds) for joining RL + OPD rollouts
OPD_WEIGHT	0.3	Weight for OPD loss in combined training
RL_WEIGHT	0.7	Weight for RL loss in combined training
TRAINING_CLIP_EPSILON_HIGH	0.28	Asymmetric high clip bound
META_RL_ENABLED	false	Enable manager meta-RL training
META_RL_WINDOW_SIZE	200	Sliding window for meta-RL score tracking
META_RL_MIN_FEEDBACK	200	Min feedback events before meta-training
PRM_NUM_VOTES	3	Number of parallel LLM judge evaluations (majority voting)
SKILLS_ENABLED	false	Enable SkillRL skill injection
SKILLS_DIR	skills_data	Directory for skill SQLite database
SKILL_EVOLUTION_THRESHOLD	0.4	Score threshold below which skills are extracted
SKILL_RETRIEVAL_TOP_K	3	Number of skills to inject per task
TINKER_API_KEY	(empty)	API key for Tinker cloud training
TINKER_BASE_URL	https://api.tinker.thinkingmachines.ai	Tinker API base URL
TRAINING_SCHEDULE_ENABLED	false	Enable time-window gated training
TRAINING_SCHEDULE_HOURS	02:00-06:00	UTC training window (HH:MM-HH:MM)
ALIGNMENT_ENABLED	false	Enable alignment experiment infrastructure
ALIGNMENT_RESULTS_DIR	alignment_results	Directory for experiment result JSON files
ALIGNMENT_AUDIT_ALL	false	Enable full NATS event audit logging
TRAJECTORY_STORE_ENABLED	false	Enable SQLite trajectory persistence
TRAJECTORY_STORE_PATH	trajectory_data/trajectories.db	Path to trajectory SQLite database
HALUGATE_ENABLED	false	Enable HaluGate hallucination scorer
HALUGATE_MODEL	qwen2.5:1.5b	Model for HaluGate claim extraction/verification
CISPO_ENABLED	false	Enable CISPO contrastive loss
CISPO_WEIGHT	0.2	Weight for contrastive loss term
CISPO_MARGIN	0.5	Margin for contrastive trajectory loss
DAPO_ENTROPY_BETA	0.01	Entropy bonus coefficient for DAPO
DAPO_MIN_REWARD_THRESHOLD	0.1	Min reward for DAPO dynamic sampling filter
TRAINED_PRM_ENABLED	false	Enable trained PRM scorer
TRAINED_PRM_MODEL	Qwen/Qwen2.5-0.5B	Base model for trained PRM
TRAINED_PRM_CHECKPOINT	(empty)	Path to trained PRM reward head checkpoint
BENCHMARK_DATASET_DIR	benchmark_data	Directory containing benchmark JSONL files
BENCHMARK_RESULTS_DIR	benchmark_results	Directory for benchmark result JSON output

Commit Format

Conventional commits:

feat: new feature
fix: bug fix
docs: documentation only
refactor: code restructuring
test: adding/updating tests
chore: maintenance tasks

Current Phase

Phase 9c complete — Advanced scorers, contrastive training, benchmarks.

Phase 8 additions:

CLI entry point (src/cli.py) — tentalis init|train|serve|status via Typer + Rich
OpenRLHF training backend (src/training/openrlhf_backend.py) — Trainer protocol adapter for Ray+vLLM+DeepSpeed
OpenClaw-RL OPD mode (src/opd/hint_extractor.py) — per-token logprob extraction from vLLM teacher models
Backend selection in src/services/training.py — standalone vs openrlhf
OPD_MODE config var — "lightweight" or "openclaw"
Docker Compose with commented OpenRLHF GPU trainer service
Honest README positioning — two-layer architecture, novel vs adopted components

Phase 7 (complete):

Intercept Proxy, OPD, CombinedScorer, Meta-RL, Adapter Registry, Multi-env Workers

Previous phases:

Phase 6: OpenClaw integration, Bridge Service, Docker Compose demo
Phase 5: InferenceClient protocol, weight hot-swap, OpenRLHF, Semantic Router readiness
Phase 4: Standalone GRPO trainer, LoRA fine-tuning, TrainingLoop
Phase 3: LLM workers (Ollama), PRM scoring (LLM-as-judge), training bridge
Phase 2: Event loop, Manager/Worker agents, EchoWorker
Phase 1: Scaffolding, docs, git

Phase 9a (complete — MetaClaw adoption):

Majority Voting PRM (src/rewards/scorer.py) — parallel LLM judge evals with median aggregation
SkillRL (src/skills/) — skill store, embedding retriever, evolver from feedback, skill injection in workers/proxy
Tinker training backend (src/training/tinker_backend.py) — cloud-managed RL via Tinker SDK
Interactive setup wizard (src/setup_wizard.py) — Rich multi-step config wizard
Session-stateful intercept proxy (src/intercept/session_manager.py) — session tracking + skill injection
Training scheduler (src/training/scheduler.py) — time-window gated training

Phase 9b (complete — Alignment experiments):

Alignment scenario library (src/alignment/scenarios.py) — 40 scenarios across 4 categories
Behavioral eval harness (src/alignment/behavioral_eval.py) — PatternBased + LLMJudge evaluators
Hackable scorer (src/alignment/hackable_scorer.py) — deliberately weak scorer + divergence detector
Misaligned worker (src/alignment/misaligned_worker.py) — keyword stuffing, confidence inflation, shortcut
Collusion detector (src/alignment/collusion_detector.py) — Pearson + Jaccard cross-worker analysis
Audit logger (src/alignment/audit_logger.py) — full NATS event capture to JSONL
Experiment runner (src/alignment/runner.py) — 6 experiments (mock-mode, standalone)
Streamlit dashboard (src/alignment/dashboard/app.py) — results viewer + constitution editor
CLI experiment subcommand — tentalis experiment run|results
AlignmentEvalEvent + AuditLogEvent event types, subscribe_raw on EventBus

Phase 9c (complete — Advanced Scorers + Benchmarks):

Trajectory store (src/training/trajectory_store.py) — SQLite-backed scored trajectory persistence
HaluGate scorer (src/rewards/halugate_scorer.py) — hallucination detection via claim extraction + NLI verification
CISPO contrastive loss (src/training/cispo.py, src/training/cispo_trainer.py) — margin-based + InfoNCE contrastive loss
DAPO graduation (src/training/dapo.py) — dynamic sampling filter + entropy bonus + DAPOTrainer
Trained PRM (src/rewards/trained_prm.py, src/rewards/prm_trainer.py) — frozen LLM + learned RewardHead scorer
Benchmark suite (src/benchmarks/) — GSM8K, MATH, HumanEval with answer extraction + CLI
CLI benchmark subcommand — tentalis benchmark run|results
14 new config vars for trajectory store, HaluGate, CISPO, DAPO, Trained PRM, benchmarks

Next: Priority 2 (signal richness) — Token OPD advantages, implicit signal extraction, session-aware classification.

Key Documents

EXPERIMENT.md — Alignment experiment tracking (6 experiments, hypotheses, metrics)
PLAN.md — Full technical research & architecture bible (papers, analysis, decisions)
LEARNING.md — Mistake/lesson tracking for autonomous decisions
RESEARCH-EXPERIMENT.md — Phase experiment records and findings

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tentalis

Architecture

Language

Current Dependencies (pyproject.toml)

Directory Structure

Code Style

Testing

Configuration (env vars)

Commit Format

Current Phase

Key Documents

FilesExpand file tree

CLAUDE.md

Latest commit

History

CLAUDE.md

File metadata and controls

Tentalis

Architecture

Language

Current Dependencies (pyproject.toml)

Directory Structure

Code Style

Testing

Configuration (env vars)

Commit Format

Current Phase

Key Documents