You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
A research framework for studying how LLM-based agents plan and complete long-horizon goals in text-based environments where NPCs may provide deceptive or manipulative information. The agent must decide what information to trust while collecting sigils and unlocking a vault — incorrect beliefs propagate across time and lead to failed plans.
Belief-Tracking and Memory+Trust achieve 100% success across all deception levels (LR=0.0–0.7) on GPT-OSS-120B in both default and extended worlds — matching the oracle upper bound
The Reflection Paradox: Reflection-Enhanced (28%) performs worse than Naive (64%) — adding reasoning decreases performance under resource pressure
Extended world amplifies differentiation: Naive drops from 64% → 33%, while trust-based variants remain at 100%, confirming robustness across environment complexity
Llama-4-Scout fails completely (0%) without structured hints despite producing valid JSON — it is a planning failure, not a formatting failure
Payload hints as a diagnostic tool: The hint ablation decomposes failures into planning vs. reasoning — Llama's 0%→93% gap on Belief-Tracking shows planning is the bottleneck, not deception reasoning
Planning capability, not reasoning capability, is the primary bottleneck for LLM agents in structured environments
See RESULTS_ANALYSIS.md for the full write-up with tables, discussion, and limitations.
Default World Results (GPT-OSS-120B, 5 runs, no hints)
Variant
LR=0.0
LR=0.1
LR=0.3
LR=0.5
LR=0.7
Overall
Oracle
100%
100%
100%
100%
100%
100%
Random
0%
0%
0%
0%
0%
0%
Naive
100%
100%
100%
0%
20%
64%
Belief-Tracking
100%
100%
100%
100%
100%
100%
Reflection-Enh.
40%
20%
60%
20%
0%
28%
Memory+Trust
100%
100%
100%
100%
100%
100%
Extended World Results (GPT-OSS-120B, 3 runs, no hints)
Variant
LR=0.0
LR=0.3
LR=0.5
LR=0.7
Overall
Oracle
100%
100%
100%
100%
100%
Random
0%
0%
0%
0%
0%
Naive
100%
33%
0%
0%
33%
Belief-Tracking
100%
100%
100%
100%
100%
Reflection-Enh.
67%
100%
33%
33%
58%
Memory+Trust
100%
100%
100%
100%
100%
Cross-Model Comparison (No Hints)
Model
Naive
Belief-Track.
Reflect.-Enh.
Memory+Trust
Overall
GPT-OSS-120B
64%
100%
28%
100%
73%
Llama-4-Scout
0%
0%
0%
0%
0%
Hint Ablation (With Structured Payload Hints)
Model
Naive
Belief-Track.
Reflect.-Enh.
Memory+Trust
Overall
GPT-OSS-120B
95%
100%
100%
100%
95%
Llama-4-Scout
0%
93%
93%
7%
48%
LLM Call Logs
All real LLM API calls are saved to artifacts/logs/calls.jsonl when running in hybrid or full mode. Each entry contains:
Full system prompt and user payload sent to the model
Raw text response from the LLM
Parsed JSON output
Timestamp and task type (agent_action, agent_reflection)