Skip to content

Dry-run crashes with KeyError when a checkpoint-driven grouped job is present #463

@alessandrogentili001

Description

@alessandrogentili001

Software Versions

$ snakemake --version
snakemake 9.19.0

$ conda list | grep "snakemake-executor-plugin-slurm"
snakemake-executor-plugin-slurm 2.6.1 pyhdfd78af_0 bioconda
snakemake-executor-plugin-slurm-jobstep 0.4.0 pyhdfd78af_0 bioconda

$ sinfo --version
slurm 23.11.10-BullSequana.1.2.1

Describe the bug
A clear and concise description of what the bug is.

snakemake --profile profiles/leonardo.yaml --dry-run crashes with:

KeyError: collect_loop_result

when the workflow contains a checkpoint-driven dynamic branch that is part of a group (described inside profiles/leonardo.yaml).

The same workflow executes correctly in a real run. In that case, the dynamic loop branch is executed inside a single grouped SLURM job and is faster compared to the non-grouped version (since we may have scheduling overhead at each iteration).

Logs

[agentil1@login01 snakemake-tutorial]$ make dry-run
make[1]: Entering directory '/leonardo_work/PHD_gentili/snakemake-tutorial'
snakemake --profile profiles/leonardo.yaml --dry-run
Using profile profiles/leonardo.yaml for setting default command line arguments.
host: login01.leonardo.local
Building DAG of jobs...
Job stats:
job                    count
-------------------  -------
bwa_map                    3
evaluate_node_n            1
collect_loop_result        1
samtools_sort              3
samtools_index             3
bcftools_call              1
plot_quals                 1
all                        1
total                     14

...
... ### REST OF THE WORKFLOW ### 
...
    [Mon May 25 15:31:26 2026]
    rule samtools_sort:
        input: mapped_reads/A.bam
        output: sorted_reads/A.bam
        jobid: 3
        reason: Missing output files: sorted_reads/A.bam; Input files updated by another job: mapped_reads/A.bam
        wildcards: sample=A
        threads: 3
        resources: tmpdir=/scratch_local, disk_mb=<TBD>, disk=<TBD>, disk_mib=<TBD>, mem_mb=8000, mem=8 GB, mem_mib=7630, slurm_account=phd_gentili, slurm_partition=boost_usr_prod, runtime=24, ntasks_per_node=8, cpus_per_task=4
Traceback (most recent call last):

  File "/leonardo/home/userinternal/agentil1/miniconda3/envs/snakemake-tutorial/lib/python3.12/site-packages/snakemake/cli.py", line 2308, in args_to_api
    dag_api.execute_workflow(

  File "/leonardo/home/userinternal/agentil1/miniconda3/envs/snakemake-tutorial/lib/python3.12/site-packages/snakemake/api.py", line 646, in execute_workflow
    workflow.execute(

  File "/leonardo/home/userinternal/agentil1/miniconda3/envs/snakemake-tutorial/lib/python3.12/site-packages/snakemake/workflow.py", line 1461, in execute
    raise e

  File "/leonardo/home/userinternal/agentil1/miniconda3/envs/snakemake-tutorial/lib/python3.12/site-packages/snakemake/workflow.py", line 1457, in execute
    success = self.scheduler.schedule()
              ^^^^^^^^^^^^^^^^^^^^^^^^^

  File "/leonardo/home/userinternal/agentil1/miniconda3/envs/snakemake-tutorial/lib/python3.12/site-packages/snakemake/scheduling/job_scheduler.py", line 410, in schedule
    raise e

  File "/leonardo/home/userinternal/agentil1/miniconda3/envs/snakemake-tutorial/lib/python3.12/site-packages/snakemake/scheduling/job_scheduler.py", line 233, in schedule
    self._finish_jobs()

  File "/leonardo/home/userinternal/agentil1/miniconda3/envs/snakemake-tutorial/lib/python3.12/site-packages/snakemake/scheduling/job_scheduler.py", line 495, in _finish_jobs
    self.workflow.async_run(postprocess())

  File "/leonardo/home/userinternal/agentil1/miniconda3/envs/snakemake-tutorial/lib/python3.12/site-packages/snakemake/workflow.py", line 268, in async_run
    return runner.run(coro)
           ^^^^^^^^^^^^^^^^

  File "/leonardo/home/userinternal/agentil1/miniconda3/envs/snakemake-tutorial/lib/python3.12/asyncio/runners.py", line 118, in run
    return self._loop.run_until_complete(task)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

  File "/leonardo/home/userinternal/agentil1/miniconda3/envs/snakemake-tutorial/lib/python3.12/asyncio/base_events.py", line 691, in run_until_complete
    return future.result()
           ^^^^^^^^^^^^^^^

  File "/leonardo/home/userinternal/agentil1/miniconda3/envs/snakemake-tutorial/lib/python3.12/site-packages/snakemake/scheduling/job_scheduler.py", line 490, in postprocess
    await self.workflow.dag.finish(

  File "/leonardo/home/userinternal/agentil1/miniconda3/envs/snakemake-tutorial/lib/python3.12/site-packages/snakemake/dag.py", line 2325, in finish
    potential_new_ready_jobs = self.update_ready(depending)
                               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^

  File "/leonardo/home/userinternal/agentil1/miniconda3/envs/snakemake-tutorial/lib/python3.12/site-packages/snakemake/dag.py", line 1906, in update_ready
    group = self._group[job]
            ~~~~~~~~~~~^^^^^

KeyError: collect_loop_result

make[1]: *** [makefile:26: dry-run] Error 1
make[1]: Leaving directory '/leonardo_work/PHD_gentili/snakemake-tutorial'
[agentil1@login01 snakemake-tutorial]$ 

Minimal example

Consider a Snakefile with this dynamic node + slurm grouping:

# --- CHECKPOINT LOOP LOGIC ---
# Snakemake's DAG evaluates backward. To create a forward "while" loop, we use 
# a checkpoint to manually generate the next file, and an input function to recursively 
# evaluate whether the loop is done or needs the next iteration.

checkpoint evaluate_node_n:
    input:
        "loop_files/{n}.txt"
    output:
        "loop_files_check/{n}_status.txt"
    resources:
        mem_mb=1000
    group: "loop" # Separate group for the loop logic
    script:
        "scripts/increment_loop.py"

def evaluate_loop_dynamically(wildcards):
    """
    A recursive Python function that acts as a forward-processing while-loop using Snakemake Checkpoints.
    It triggers checkpoint n=0, waits for it, checks the output, and either returns the final file
    or asks for n=1, and so on.
    """
    def check_node(n):
        # 1. Trigger execution of checkpoint for this specific 'n'.
        # Snakemake raises an IncompleteCheckpointException implicitly, pausing evaluate_loop_dynamically
        # until evaluate_node_n finishes executing for {n}.
        with checkpoints.evaluate_node_n.get(n=n).output[0].open() as f:
            status = f.read().strip()
            
        if status == "done":
            # 2a. Loop condition is met. Internal checkpoint created nothing_special.txt!
            return "loop_files/nothing_special.txt"
        else:
            # 2b. Checkpoint generated {n+1}.txt manually on disk. 
            # Recursively evaluate the NEXT node dynamically!
            return check_node(n + 1)
            
    # Start the forward-chain from node 0
    return check_node(0)

# Final aggregating rule that triggers the dynamic input function
rule collect_loop_result:
    input:
        evaluate_loop_dynamically
    output:
        "loop_files/loop_finished.txt"
    resources:
        mem_mb=1000
    group: "loop" # Separate group for the loop logic
    shell:
        "cp {input} {output}"

Additional context

The same configuration runs as expected:

[agentil1@login01 snakemake-tutorial]$ make run-login
make[1]: Entering directory '/leonardo_work/PHD_gentili/snakemake-tutorial'
snakemake --profile profiles/leonardo.yaml
Using profile profiles/leonardo.yaml for setting default command line arguments.
host: login01.leonardo.local
Building DAG of jobs...
You are running snakemake in a SLURM job context. This is not recommended, as it may lead to unexpected behavior. If possible, please run Snakemake directly on the login node.
SLURM run ID: workflow_node_2c8cc88e-46ed-47a9-b6ee-423dce74b686
MinJobAge 30s (>= 30s). 'squeue' should work reliably for status queries.
Using shell: /usr/bin/bash
Provided remote nodes: 3
Conda environments: ignored
Job stats:
job                    count
-------------------  -------
bwa_map                    3
evaluate_node_n            1
collect_loop_result        1
samtools_sort              3
samtools_index             3
bcftools_call              1
plot_quals                 1
all                        1
total                     14

Select jobs to execute...
Execute 3 jobs...
Select jobs to execute...
Job 595ad3d4-a096-53e6-9bff-b8eeb6dd7ec7 has been submitted with SLURM jobid 42512254 (log: /leonardo_work/PHD_gentili/snakemake-tutorial/logs/slurm/group_pre_processing_bwa_map_samtools_index_samtools_sort/42512254.log).
Job 74b63bb2-ef49-51b8-9df2-ecd163c74f57 has been submitted with SLURM jobid 42512255 (log: /leonardo_work/PHD_gentili/snakemake-tutorial/logs/slurm/group_pre_processing_bwa_map_samtools_index_samtools_sort/42512255.log).
Job de2f1f4f-f96c-5d06-b5bc-8f130f0ca6e9 has been submitted with SLURM jobid 42512270 (log: /leonardo_work/PHD_gentili/snakemake-tutorial/logs/slurm/group_pre_processing_bwa_map_samtools_index_samtools_sort/42512270.log).
Write-protecting output file sorted_reads/A.bam.
[Mon May 25 15:35:44 2026]
Finished jobid: 4 (Rule: bwa_map)
[Mon May 25 15:35:44 2026]
Finished jobid: 3 (Rule: samtools_sort)
[Mon May 25 15:35:44 2026]
Finished jobid: 9 (Rule: samtools_index)
3 of 14 steps (21%) done
Write-protecting output file sorted_reads/B.bam.
[Mon May 25 15:35:44 2026]
Finished jobid: 6 (Rule: bwa_map)
[Mon May 25 15:35:44 2026]
Finished jobid: 5 (Rule: samtools_sort)
[Mon May 25 15:35:44 2026]
Finished jobid: 10 (Rule: samtools_index)
6 of 14 steps (43%) done
Write-protecting output file sorted_reads/C.bam.
[Mon May 25 15:35:45 2026]
Finished jobid: 8 (Rule: bwa_map)
[Mon May 25 15:35:45 2026]
Finished jobid: 7 (Rule: samtools_sort)
[Mon May 25 15:35:45 2026]
Finished jobid: 11 (Rule: samtools_index)
9 of 14 steps (64%) done
Execute 2 jobs...
Job c126a408-0728-5756-8954-806a37075583 has been submitted with SLURM jobid 42512285 (log: /leonardo_work/PHD_gentili/snakemake-tutorial/logs/slurm/group_loop_collect_loop_result_evaluate_node_n/42512285.log).
Job 162f9ed7-6b03-5146-8516-a06e591c52d5 has been submitted with SLURM jobid 42512286 (log: /leonardo_work/PHD_gentili/snakemake-tutorial/logs/slurm/group_core_analysis_bcftools_call/42512286.log).
[Mon May 25 15:36:17 2026]
Finished jobid: 13 (Rule: evaluate_node_n)
[Mon May 25 15:36:17 2026]
Finished jobid: 12 (Rule: collect_loop_result)
12 of 14 steps (86%) done
Updating checkpoint dependencies.
Removing temporary output mapped_reads/A.bam.
Removing temporary output mapped_reads/A.bam.
Removing temporary output mapped_reads/B.bam.
Removing temporary output mapped_reads/B.bam.
Removing temporary output mapped_reads/C.bam.
Removing temporary output mapped_reads/C.bam.
[Mon May 25 15:36:17 2026]
Finished jobid: 2 (Rule: bcftools_call)
13 of 14 steps (93%) done
Select jobs to execute...
Execute 1 jobs...

[Mon May 25 15:36:17 2026]
localrule plot_quals:
    input: calls/all.vcf
    output: plots/quals.svg
    log: logs/plots.log
    jobid: 1
    reason: Missing output files: plots/quals.svg; Input files updated by another job: calls/all.vcf
    resources: tmpdir=/scratch_local, disk_mb=1000, disk=1 GB, disk_mib=954, mem_mb=32000, mem=32 GB, mem_mib=30518, slurm_account=phd_gentili, slurm_partition=boost_usr_prod, runtime=24, ntasks_per_node=8, cpus_per_task=4
[Mon May 25 15:36:18 2026]
Finished jobid: 1 (Rule: plot_quals)
14 of 14 steps (100%) done
Select jobs to execute...
Execute 1 jobs...

[Mon May 25 15:36:18 2026]
localrule all:
    input: plots/quals.svg, loop_files/loop_finished.txt
    jobid: 0
    reason: Input files updated by another job: loop_files/loop_finished.txt, plots/quals.svg
    resources: tmpdir=/scratch_local, disk_mb=1000, disk=1 GB, disk_mib=954, mem_mb=32000, mem=32 GB, mem_mib=30518, slurm_account=phd_gentili, slurm_partition=boost_usr_prod, runtime=24, ntasks_per_node=8, cpus_per_task=4
[Mon May 25 15:36:18 2026]
Finished jobid: 0 (Rule: all)
15 of 14 steps (107%) done
Efficiency report for workflow workflow_node_2c8cc88e-46ed-47a9-b6ee-423dce74b686 saved to /leonardo_work/PHD_gentili/snakemake-tutorial/efficiency_report_workflow_node_2c8cc88e-46ed-47a9-b6ee-423dce74b686.csv.
Complete log(s): /leonardo_work/PHD_gentili/snakemake-tutorial/.snakemake/log/2026-05-25T153500.582461.snakemake.log
make[1]: Leaving directory '/leonardo_work/PHD_gentili/snakemake-tutorial'
[agentil1@login01 snakemake-tutorial]$ 

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions