Skip to content

EtaYang10th/spark-skills

Repository files navigation

SPARK

Evidence Over Plans: Online Trajectory Verification for Skill Distillation

Structured Pipelines for Autonomous Runnable tasKs and sKill generation

arXiv:2605.09192 Paper PDF SPARK blog SPARK code Hugging Face Dataset M3-Bench blog

Paper: arXiv:2605.09192 Β· PDF
πŸ”— Project page / Blog: https://etayang10th.github.io/spark.github.io/
πŸ€— Dataset: huggingface.co/datasets/EtaYang10th/SPARK_PDI_Trajectory
πŸ§ͺ Sibling project β€” MΒ³-Bench: https://etayang10th.github.io/m3-bench.github.io/ Β· code Β· arXiv 2511.17729

SPARK pipeline overview

SPARK is a research prototype that turns environment-verified trajectories into reusable agent skills. It is built around two pipelines that compose naturally:

  • Skill generation β€” a teacher agent explores a Dockerized task, and a successful trajectory is distilled into a SKILL.md.
  • Task construction β€” a natural-language prompt is built-and-verified into a runnable, oracle-validated task.

At the centre sits the Posterior Distillation Index (PDI) β€” a trajectory-level score that measures whether a skill is grounded in posterior execution evidence rather than stale prior plans. SPARK uses PDI both as a retrospective diagnostic and as an online intervention signal during exploration.


Highlights

  • Posterior over prior. Skills are distilled from what the agent actually observed in the environment, not from a pre-written plan.
  • Online PDI intervention. A memo-based PDI proxy monitors exploration in real time and nudges the teacher when trajectories start to ossify.
  • Transferable, cheap student inference. On 86 tasks across 11 domains, SPARK-generated skills consistently beat no-skill baselines and outperform human-written ones on most student models β€” student inference runs at roughly $0.02 per task, up to 1,000Γ— cheaper than teacher exploration.
  • Full-trajectory logging. Every run preserves execution logs, verifier signals, and memo histories for trajectory-level analysis.

SPARK vs baseline vs human-written skills


Demo

SPARK skill-generation demo


Case studies

Two short animations, regenerated directly from the logged trajectories, illustrate when PDI matters most.

Case 1 Β· lean4-proof β€” Non-PDI skill vs PDI-refined skill

Held-out student (Claude Haiku 4.5): 15 trial-and-error commands with repeated failures under the non-PDI skill, then 5 clean commands reaching reward = 1.0 on the first lake env lean -DwarningAsError=true compile once the PDI-refined skill is in place.

Case 1: lean4-proof before vs after

Case 2 Β· Online PDI intervention β€” w/ PDI vs observe-only

Top row (green panels) uses online PDI intervention; bottom row (red panels) is the observe-only control. Both rows run on the same two tasks with the same budget. The top row ends early because the agent already solved the task (reward 1.0 at attempt 8 / attempt 4); the bottom row exhausts the full attempt budget with 0 successes. Orange circles mark soft interventions, red circles mark strong ones.

Case 2: online PDI intervention dynamics

Both animations are produced by small standalone scripts:

python scripts/make_case1_lean4_gif.py
python scripts/make_exploration_dynamics_gif.py

How it works

The skill-generation loop is execute β†’ judge β†’ reflect β†’ retry β†’ distill:

  1. The teacher agent interacts with a Dockerized environment for up to N_max attempts.
  2. Each attempt produces a terminal interaction log and a verifier record.
  3. On success β€” the full trajectory is compiled into six evidence blocks and distilled into SKILL.md: Task Pattern Β· Execution Chain Β· Verification Β· Lessons Β· Environment Β· Raw Support Tail.
  4. On failure β€” a five-section exploration memo is completely rewritten to carry forward only what is useful: Attempts Log Β· Commands Β· Verified Facts Β· Current Error Pattern Β· Next Strategy. A PDI proxy can then trigger targeted interventions before the next retry.
  5. The distilled skill is evaluated cross-model: a weaker student agent runs the same (or independently constructed) tasks with the injected SKILL.md.

The task-construction pipeline follows a blueprint β†’ repair β†’ critique β†’ oracle-validate pattern; only tasks that pass deterministic oracle verification are accepted.


Requirements

  • Python 3.12
  • uv
  • Docker with a working Harbor setup
  • An OpenAI-compatible LLM endpoint (optional: DashScope / DeepSeek / Zhipu are auto-routed by the helper scripts)

Both pipelines read OPENAI_API_KEY and OPENAI_BASE_URL from the environment. A local .env file is auto-loaded if present:

cp .env_example .env

If you use the helper shell scripts, also fill in DASHSCOPE_API_KEY for the qwen workflows.


Quick start

1. Install

uv sync

2. Generate a task from a prompt

Use the example spec in spark_tasks_gen/examples/3d_scan_calc_prompt.json, or write your own JSON with prompt, available_tools, environment_hints, and constraints:

uv run python run_tasks_gen.py \
  --prompt-file spark_tasks_gen/examples/3d_scan_calc_prompt.json \
  --model gpt-5.4

The pipeline runs blueprint β†’ repair β†’ critique β†’ oracle validation and writes the accepted task to spark_tasks_gen/generated_tasks/<task-id>/.

3. Generate skills from tasks

uv run python run_pipeline.py \
  --agent qwen-coder \
  --model qwen3-coder-next \
  --tasks-dir tasks-no-skills \
  --max-retries 3 \
  --parallelism 4

Useful flags:

Flag Purpose
--pdi-enabled Turn on PDI-guided online intervention.
--pdi-observe-only Compute PDI without intervening (diagnostics).
--pdi-method {token_overlap,js_divergence} Choose the PDI proxy.
--resume / --shuffle / --shared-result-dir Iterative and multi-model workflows.
--no-dashboard CLI-only mode (dashboard defaults to http://localhost:8765).

4. Evaluate generated skills

run_eval_skills.py runs a three-way comparison on the subset of tasks that already have a generated SKILL.md:

Phase Setup
baseline Original tasks-no-skills, no skill injected.
generated SPARK's SKILL.md injected into a staged task under save/.
human Human-written skills from SkillsBench.
uv run python run_eval_skills.py \
  --agent qwen-coder \
  --model qwen3-coder-next \
  --skill-source-model qwen3-coder-next \
  --tasks-dir tasks-no-skills

Staging copies are written under save/ and removed after each run.


Repository layout

code/
β”œβ”€β”€ run_tasks_gen.py              # task construction CLI
β”œβ”€β”€ run_pipeline.py               # skill generation CLI (+ dashboard)
β”œβ”€β”€ run_eval_skills.py            # 3-way skill evaluation CLI
β”œβ”€β”€ spark_tasks_gen/              # prompt β†’ blueprint β†’ critique β†’ oracle
β”œβ”€β”€ spark_skills_gen/             # execute β†’ judge β†’ reflect β†’ distill
β”‚   β”œβ”€β”€ pipeline.py               # main loop
β”‚   β”œβ”€β”€ skill_evidence.py         # six evidence blocks
β”‚   β”œβ”€β”€ summarizer.py             # reflect + distill LLM calls
β”‚   β”œβ”€β”€ judge.py                  # parse result.json β†’ PASS / FAIL / PARTIAL
β”‚   β”œβ”€β”€ trajectory.py             # trajectory writer + PDI signals
β”‚   └── dashboard/                # FastAPI live dashboard
β”œβ”€β”€ scripts/                      # convenience wrappers (conda env `spark`)
β”œβ”€β”€ figure/                       # illustrations
└── tasks* / save / spark-jobs    # task sources, staging, Harbor outputs

Outputs

After a typical run you will find:

  • generated Harbor tasks β€” spark_tasks_gen/generated_tasks/<task-id>/
  • task-generation traces β€” spark_tasks_gen/generated_tasks/_artifacts/<task-id>/
  • Harbor execution outputs β€” spark-jobs/
  • distilled skills and attempt logs β€” spark_skills_gen/skills_gen_result/<model>/<task>/
  • evaluation summaries β€” spark_skills_gen/skills_eval_result/<model>/<run-id>/

Dataset: SPARK PDI Trajectory (Hugging Face)

The full exploration memos, successful trajectories, and distilled SKILL.md artifacts used for PDI analysis are released on the Hugging Face Hub:

SPARK_PDI_Trajectory on Hugging Face

# via huggingface_hub
pip install huggingface_hub
huggingface-cli download EtaYang10th/SPARK_PDI_Trajectory --repo-type dataset --local-dir ./spark_pdi_trajectory

# or via git (large files use Git LFS / xet)
git clone https://huggingface.co/datasets/EtaYang10th/SPARK_PDI_Trajectory

Getting the SkillsBench tasks

tasks/ and tasks-no-skills/ reuse the task suite from SkillsBench (paper). A minimal sparse-checkout:

git clone --filter=blob:none --no-checkout https://github.com/benchflow-ai/skillsbench.git
cd skillsbench
git sparse-checkout init --cone
git sparse-checkout set tasks tasks-no-skills
git checkout main

Then copy the folders into your SPARK workspace:

cp -r skillsbench/tasks            /path/to/SPARK/code/
cp -r skillsbench/tasks-no-skills  /path/to/SPARK/code/

Helper scripts

Local convenience wrappers (they assume a conda env named spark):

  • bash scripts/run_tasks_gen.sh
  • bash scripts/run_skills_gen.sh
  • bash scripts/run_eval_skills.sh

run_skills_gen.sh auto-routes credentials for DashScope / DeepSeek / Zhipu based on the model prefix; the Python entry points above remain the more portable option.


Citation

If SPARK or PDI helps your research, please cite:

@misc{zhou2026spark,
  title         = {Evidence Over Plans: Online Trajectory Verification for Skill Distillation},
  author        = {Zhou, Yang and Dong, Zihan and Wang, Zhenting and Jin, Can and
                   Zhao, Shiyu and Guo, Bangwei and Gu, Difei and Zhang, Linjun and
                   Zhou, Mu and Metaxas, Dimitris N.},
  year          = {2026},
  eprint        = {2605.09192},
  archivePrefix = {arXiv},
  primaryClass  = {cs.AI},
  url           = {https://arxiv.org/abs/2605.09192}
}

About

Evidence Over Plans: Online Trajectory Verification for Skill Distillation

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors