SPARK

Evidence Over Plans: Online Trajectory Verification for Skill Distillation

Structured Pipelines for Autonomous Runnable tasKs and sKill generation

Paper: arXiv:2605.09192 · PDF
🔗 Project page / Blog: https://etayang10th.github.io/spark.github.io/
🤗 Dataset: huggingface.co/datasets/EtaYang10th/SPARK_PDI_Trajectory
🧪 Sibling project — M³-Bench: https://etayang10th.github.io/m3-bench.github.io/ · code · arXiv 2511.17729

SPARK is a research prototype that turns environment-verified trajectories into reusable agent skills. It is built around two pipelines that compose naturally:

Skill generation — a teacher agent explores a Dockerized task, and a successful trajectory is distilled into a SKILL.md.
Task construction — a natural-language prompt is built-and-verified into a runnable, oracle-validated task.

At the centre sits the Posterior Distillation Index (PDI) — a trajectory-level score that measures whether a skill is grounded in posterior execution evidence rather than stale prior plans. SPARK uses PDI both as a retrospective diagnostic and as an online intervention signal during exploration.

Highlights

Posterior over prior. Skills are distilled from what the agent actually observed in the environment, not from a pre-written plan.
Online PDI intervention. A memo-based PDI proxy monitors exploration in real time and nudges the teacher when trajectories start to ossify.
Transferable, cheap student inference. On 86 tasks across 11 domains, SPARK-generated skills consistently beat no-skill baselines and outperform human-written ones on most student models — student inference runs at roughly $0.02 per task, up to 1,000× cheaper than teacher exploration.
Full-trajectory logging. Every run preserves execution logs, verifier signals, and memo histories for trajectory-level analysis.

Demo

Case studies

Two short animations, regenerated directly from the logged trajectories, illustrate when PDI matters most.

Case 1 · lean4-proof — Non-PDI skill vs PDI-refined skill

Held-out student (Claude Haiku 4.5): 15 trial-and-error commands with repeated failures under the non-PDI skill, then 5 clean commands reaching reward = 1.0 on the first lake env lean -DwarningAsError=true compile once the PDI-refined skill is in place.

Case 2 · Online PDI intervention — w/ PDI vs observe-only

Top row (green panels) uses online PDI intervention; bottom row (red panels) is the observe-only control. Both rows run on the same two tasks with the same budget. The top row ends early because the agent already solved the task (reward 1.0 at attempt 8 / attempt 4); the bottom row exhausts the full attempt budget with 0 successes. Orange circles mark soft interventions, red circles mark strong ones.

Both animations are produced by small standalone scripts:

python scripts/make_case1_lean4_gif.py
python scripts/make_exploration_dynamics_gif.py

How it works

The skill-generation loop is execute → judge → reflect → retry → distill:

The teacher agent interacts with a Dockerized environment for up to N_max attempts.
Each attempt produces a terminal interaction log and a verifier record.
On success — the full trajectory is compiled into six evidence blocks and distilled into SKILL.md: Task Pattern · Execution Chain · Verification · Lessons · Environment · Raw Support Tail.
On failure — a five-section exploration memo is completely rewritten to carry forward only what is useful: Attempts Log · Commands · Verified Facts · Current Error Pattern · Next Strategy. A PDI proxy can then trigger targeted interventions before the next retry.
The distilled skill is evaluated cross-model: a weaker student agent runs the same (or independently constructed) tasks with the injected SKILL.md.

The task-construction pipeline follows a blueprint → repair → critique → oracle-validate pattern; only tasks that pass deterministic oracle verification are accepted.

Requirements

Python 3.12
uv
Docker with a working Harbor setup
An OpenAI-compatible LLM endpoint (optional: DashScope / DeepSeek / Zhipu are auto-routed by the helper scripts)

Both pipelines read OPENAI_API_KEY and OPENAI_BASE_URL from the environment. A local .env file is auto-loaded if present:

cp .env_example .env

If you use the helper shell scripts, also fill in DASHSCOPE_API_KEY for the qwen workflows.

Quick start

1. Install

uv sync

2. Generate a task from a prompt

Use the example spec in spark_tasks_gen/examples/3d_scan_calc_prompt.json, or write your own JSON with prompt, available_tools, environment_hints, and constraints:

uv run python run_tasks_gen.py \
  --prompt-file spark_tasks_gen/examples/3d_scan_calc_prompt.json \
  --model gpt-5.4

The pipeline runs blueprint → repair → critique → oracle validation and writes the accepted task to spark_tasks_gen/generated_tasks/<task-id>/.

3. Generate skills from tasks

uv run python run_pipeline.py \
  --agent qwen-coder \
  --model qwen3-coder-next \
  --tasks-dir tasks-no-skills \
  --max-retries 3 \
  --parallelism 4

Useful flags:

Flag	Purpose
`--pdi-enabled`	Turn on PDI-guided online intervention.
`--pdi-observe-only`	Compute PDI without intervening (diagnostics).
`--pdi-method {token_overlap,js_divergence}`	Choose the PDI proxy.
`--resume` / `--shuffle` / `--shared-result-dir`	Iterative and multi-model workflows.
`--no-dashboard`	CLI-only mode (dashboard defaults to http://localhost:8765).

4. Evaluate generated skills

run_eval_skills.py runs a three-way comparison on the subset of tasks that already have a generated SKILL.md:

Phase	Setup
`baseline`	Original `tasks-no-skills`, no skill injected.
`generated`	SPARK's `SKILL.md` injected into a staged task under `save/`.
`human`	Human-written skills from SkillsBench.

uv run python run_eval_skills.py \
  --agent qwen-coder \
  --model qwen3-coder-next \
  --skill-source-model qwen3-coder-next \
  --tasks-dir tasks-no-skills

Staging copies are written under save/ and removed after each run.

Repository layout

code/
├── run_tasks_gen.py              # task construction CLI
├── run_pipeline.py               # skill generation CLI (+ dashboard)
├── run_eval_skills.py            # 3-way skill evaluation CLI
├── spark_tasks_gen/              # prompt → blueprint → critique → oracle
├── spark_skills_gen/             # execute → judge → reflect → distill
│   ├── pipeline.py               # main loop
│   ├── skill_evidence.py         # six evidence blocks
│   ├── summarizer.py             # reflect + distill LLM calls
│   ├── judge.py                  # parse result.json → PASS / FAIL / PARTIAL
│   ├── trajectory.py             # trajectory writer + PDI signals
│   └── dashboard/                # FastAPI live dashboard
├── scripts/                      # convenience wrappers (conda env `spark`)
├── figure/                       # illustrations
└── tasks* / save / spark-jobs    # task sources, staging, Harbor outputs

Outputs

After a typical run you will find:

generated Harbor tasks — spark_tasks_gen/generated_tasks/<task-id>/
task-generation traces — spark_tasks_gen/generated_tasks/_artifacts/<task-id>/
Harbor execution outputs — spark-jobs/
distilled skills and attempt logs — spark_skills_gen/skills_gen_result/<model>/<task>/
evaluation summaries — spark_skills_gen/skills_eval_result/<model>/<run-id>/

Dataset: SPARK PDI Trajectory (Hugging Face)

The full exploration memos, successful trajectories, and distilled SKILL.md artifacts used for PDI analysis are released on the Hugging Face Hub:

# via huggingface_hub
pip install huggingface_hub
huggingface-cli download EtaYang10th/SPARK_PDI_Trajectory --repo-type dataset --local-dir ./spark_pdi_trajectory

# or via git (large files use Git LFS / xet)
git clone https://huggingface.co/datasets/EtaYang10th/SPARK_PDI_Trajectory

Getting the SkillsBench tasks

tasks/ and tasks-no-skills/ reuse the task suite from SkillsBench (paper). A minimal sparse-checkout:

git clone --filter=blob:none --no-checkout https://github.com/benchflow-ai/skillsbench.git
cd skillsbench
git sparse-checkout init --cone
git sparse-checkout set tasks tasks-no-skills
git checkout main

Then copy the folders into your SPARK workspace:

cp -r skillsbench/tasks            /path/to/SPARK/code/
cp -r skillsbench/tasks-no-skills  /path/to/SPARK/code/

Helper scripts

Local convenience wrappers (they assume a conda env named spark):

bash scripts/run_tasks_gen.sh
bash scripts/run_skills_gen.sh
bash scripts/run_eval_skills.sh

run_skills_gen.sh auto-routes credentials for DashScope / DeepSeek / Zhipu based on the model prefix; the Python entry points above remain the more portable option.

Citation

If SPARK or PDI helps your research, please cite:

@misc{zhou2026spark,
  title         = {Evidence Over Plans: Online Trajectory Verification for Skill Distillation},
  author        = {Zhou, Yang and Dong, Zihan and Wang, Zhenting and Jin, Can and
                   Zhao, Shiyu and Guo, Bangwei and Gu, Difei and Zhang, Linjun and
                   Zhou, Mu and Metaxas, Dimitris N.},
  year          = {2026},
  eprint        = {2605.09192},
  archivePrefix = {arXiv},
  primaryClass  = {cs.AI},
  url           = {https://arxiv.org/abs/2605.09192}
}

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
figure		figure
scripts		scripts
spark_skills_gen		spark_skills_gen
spark_tasks_gen		spark_tasks_gen
.env_example		.env_example
.gitignore		.gitignore
.python-version		.python-version
README.md		README.md
pdi_rerun_tasks.txt		pdi_rerun_tasks.txt
pyproject.toml		pyproject.toml
run_eval_skills.py		run_eval_skills.py
run_pipeline.py		run_pipeline.py
run_tasks_gen.py		run_tasks_gen.py
spark_pdi_trajectory_README.md		spark_pdi_trajectory_README.md
token_budgets.json		token_budgets.json
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

SPARK

Highlights

Demo

Case studies

Case 1 · lean4-proof — Non-PDI skill vs PDI-refined skill

Case 2 · Online PDI intervention — w/ PDI vs observe-only

How it works

Requirements

Quick start

1. Install

2. Generate a task from a prompt

3. Generate skills from tasks

4. Evaluate generated skills

Repository layout

Outputs

Dataset: SPARK PDI Trajectory (Hugging Face)

Getting the SkillsBench tasks

Helper scripts

Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

SPARK

Highlights

Demo

Case studies

Case 1 · lean4-proof — Non-PDI skill vs PDI-refined skill

Case 2 · Online PDI intervention — w/ PDI vs observe-only

How it works

Requirements

Quick start

1. Install

2. Generate a task from a prompt

3. Generate skills from tasks

4. Evaluate generated skills

Repository layout

Outputs

Dataset: SPARK PDI Trajectory (Hugging Face)

Getting the SkillsBench tasks

Helper scripts

Citation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages