Evidence Over Plans: Online Trajectory Verification for Skill Distillation
Structured Pipelines for Autonomous Runnable tasKs and sKill generation
Paper: arXiv:2605.09192 Β· PDF
π Project page / Blog: https://etayang10th.github.io/spark.github.io/
π€ Dataset: huggingface.co/datasets/EtaYang10th/SPARK_PDI_Trajectory
π§ͺ Sibling project β MΒ³-Bench: https://etayang10th.github.io/m3-bench.github.io/ Β· code Β· arXiv 2511.17729
SPARK is a research prototype that turns environment-verified trajectories into reusable agent skills. It is built around two pipelines that compose naturally:
- Skill generation β a teacher agent explores a Dockerized task, and a successful trajectory is distilled into a
SKILL.md. - Task construction β a natural-language prompt is built-and-verified into a runnable, oracle-validated task.
At the centre sits the Posterior Distillation Index (PDI) β a trajectory-level score that measures whether a skill is grounded in posterior execution evidence rather than stale prior plans. SPARK uses PDI both as a retrospective diagnostic and as an online intervention signal during exploration.
- Posterior over prior. Skills are distilled from what the agent actually observed in the environment, not from a pre-written plan.
- Online PDI intervention. A memo-based PDI proxy monitors exploration in real time and nudges the teacher when trajectories start to ossify.
- Transferable, cheap student inference. On 86 tasks across 11 domains, SPARK-generated skills consistently beat no-skill baselines and outperform human-written ones on most student models β student inference runs at roughly
$0.02per task, up to 1,000Γ cheaper than teacher exploration. - Full-trajectory logging. Every run preserves execution logs, verifier signals, and memo histories for trajectory-level analysis.
Two short animations, regenerated directly from the logged trajectories, illustrate when PDI matters most.
Held-out student (Claude Haiku 4.5): 15 trial-and-error commands with repeated failures under the non-PDI skill, then 5 clean commands reaching reward = 1.0 on the first lake env lean -DwarningAsError=true compile once the PDI-refined skill is in place.
Top row (green panels) uses online PDI intervention; bottom row (red panels) is the observe-only control. Both rows run on the same two tasks with the same budget. The top row ends early because the agent already solved the task (reward 1.0 at attempt 8 / attempt 4); the bottom row exhausts the full attempt budget with 0 successes. Orange circles mark soft interventions, red circles mark strong ones.
Both animations are produced by small standalone scripts:
python scripts/make_case1_lean4_gif.py
python scripts/make_exploration_dynamics_gif.pyThe skill-generation loop is execute β judge β reflect β retry β distill:
- The teacher agent interacts with a Dockerized environment for up to
N_maxattempts. - Each attempt produces a terminal interaction log and a verifier record.
- On success β the full trajectory is compiled into six evidence blocks and distilled into
SKILL.md: Task Pattern Β· Execution Chain Β· Verification Β· Lessons Β· Environment Β· Raw Support Tail. - On failure β a five-section exploration memo is completely rewritten to carry forward only what is useful: Attempts Log Β· Commands Β· Verified Facts Β· Current Error Pattern Β· Next Strategy. A PDI proxy can then trigger targeted interventions before the next retry.
- The distilled skill is evaluated cross-model: a weaker student agent runs the same (or independently constructed) tasks with the injected
SKILL.md.
The task-construction pipeline follows a blueprint β repair β critique β oracle-validate pattern; only tasks that pass deterministic oracle verification are accepted.
- Python
3.12 uv- Docker with a working Harbor setup
- An OpenAI-compatible LLM endpoint (optional: DashScope / DeepSeek / Zhipu are auto-routed by the helper scripts)
Both pipelines read OPENAI_API_KEY and OPENAI_BASE_URL from the environment. A local .env file is auto-loaded if present:
cp .env_example .envIf you use the helper shell scripts, also fill in DASHSCOPE_API_KEY for the qwen workflows.
uv syncUse the example spec in spark_tasks_gen/examples/3d_scan_calc_prompt.json, or write your own JSON with prompt, available_tools, environment_hints, and constraints:
uv run python run_tasks_gen.py \
--prompt-file spark_tasks_gen/examples/3d_scan_calc_prompt.json \
--model gpt-5.4The pipeline runs blueprint β repair β critique β oracle validation and writes the accepted task to spark_tasks_gen/generated_tasks/<task-id>/.
uv run python run_pipeline.py \
--agent qwen-coder \
--model qwen3-coder-next \
--tasks-dir tasks-no-skills \
--max-retries 3 \
--parallelism 4Useful flags:
| Flag | Purpose |
|---|---|
--pdi-enabled |
Turn on PDI-guided online intervention. |
--pdi-observe-only |
Compute PDI without intervening (diagnostics). |
--pdi-method {token_overlap,js_divergence} |
Choose the PDI proxy. |
--resume / --shuffle / --shared-result-dir |
Iterative and multi-model workflows. |
--no-dashboard |
CLI-only mode (dashboard defaults to http://localhost:8765). |
run_eval_skills.py runs a three-way comparison on the subset of tasks that already have a generated SKILL.md:
| Phase | Setup |
|---|---|
baseline |
Original tasks-no-skills, no skill injected. |
generated |
SPARK's SKILL.md injected into a staged task under save/. |
human |
Human-written skills from SkillsBench. |
uv run python run_eval_skills.py \
--agent qwen-coder \
--model qwen3-coder-next \
--skill-source-model qwen3-coder-next \
--tasks-dir tasks-no-skillsStaging copies are written under save/ and removed after each run.
code/
βββ run_tasks_gen.py # task construction CLI
βββ run_pipeline.py # skill generation CLI (+ dashboard)
βββ run_eval_skills.py # 3-way skill evaluation CLI
βββ spark_tasks_gen/ # prompt β blueprint β critique β oracle
βββ spark_skills_gen/ # execute β judge β reflect β distill
β βββ pipeline.py # main loop
β βββ skill_evidence.py # six evidence blocks
β βββ summarizer.py # reflect + distill LLM calls
β βββ judge.py # parse result.json β PASS / FAIL / PARTIAL
β βββ trajectory.py # trajectory writer + PDI signals
β βββ dashboard/ # FastAPI live dashboard
βββ scripts/ # convenience wrappers (conda env `spark`)
βββ figure/ # illustrations
βββ tasks* / save / spark-jobs # task sources, staging, Harbor outputs
After a typical run you will find:
- generated Harbor tasks β
spark_tasks_gen/generated_tasks/<task-id>/ - task-generation traces β
spark_tasks_gen/generated_tasks/_artifacts/<task-id>/ - Harbor execution outputs β
spark-jobs/ - distilled skills and attempt logs β
spark_skills_gen/skills_gen_result/<model>/<task>/ - evaluation summaries β
spark_skills_gen/skills_eval_result/<model>/<run-id>/
The full exploration memos, successful trajectories, and distilled SKILL.md artifacts used for PDI analysis are released on the Hugging Face Hub:
# via huggingface_hub
pip install huggingface_hub
huggingface-cli download EtaYang10th/SPARK_PDI_Trajectory --repo-type dataset --local-dir ./spark_pdi_trajectory
# or via git (large files use Git LFS / xet)
git clone https://huggingface.co/datasets/EtaYang10th/SPARK_PDI_Trajectorytasks/ and tasks-no-skills/ reuse the task suite from SkillsBench (paper). A minimal sparse-checkout:
git clone --filter=blob:none --no-checkout https://github.com/benchflow-ai/skillsbench.git
cd skillsbench
git sparse-checkout init --cone
git sparse-checkout set tasks tasks-no-skills
git checkout mainThen copy the folders into your SPARK workspace:
cp -r skillsbench/tasks /path/to/SPARK/code/
cp -r skillsbench/tasks-no-skills /path/to/SPARK/code/Local convenience wrappers (they assume a conda env named spark):
bash scripts/run_tasks_gen.shbash scripts/run_skills_gen.shbash scripts/run_eval_skills.sh
run_skills_gen.sh auto-routes credentials for DashScope / DeepSeek / Zhipu based on the model prefix; the Python entry points above remain the more portable option.
If SPARK or PDI helps your research, please cite:
@misc{zhou2026spark,
title = {Evidence Over Plans: Online Trajectory Verification for Skill Distillation},
author = {Zhou, Yang and Dong, Zihan and Wang, Zhenting and Jin, Can and
Zhao, Shiyu and Guo, Bangwei and Gu, Difei and Zhang, Linjun and
Zhou, Mu and Metaxas, Dimitris N.},
year = {2026},
eprint = {2605.09192},
archivePrefix = {arXiv},
primaryClass = {cs.AI},
url = {https://arxiv.org/abs/2605.09192}
}



