Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

It's hard to figure out that a Slurm WDL workflow is failing because the default job store isn't on a shared filesystem #5007

Open
adamnovak opened this issue Jul 3, 2024 · 0 comments
Labels

Comments

@adamnovak
Copy link
Member

adamnovak commented Jul 3, 2024

I ran this:

toil-wdl-runner --batchSystem slurm https://raw.githubusercontent.com/vgteam/vg_wdl/lr-giraffe/workflows/giraffe_and_deepvariant.wdl https://raw.githubusercontent.com/vgteam/vg_wdl/lr-giraffe/params/giraffe_and_deepvariant.json --outputDirectory output/preset/

My workflow failed and I got no logs from any of the tasks.

When I manually added --batchLogsDir logs/ I got a vaguely informative error:

[2024-07-03T13:31:46-0700] [MainThread] [W] [toil.leader] The batch system left an empty file logs/toil_5afe8eb3-2e69-47a4-aa37-6802bd7a6bd2.6.3865603.out.log
[2024-07-03T13:31:46-0700] [MainThread] [W] [toil.leader] The batch system left a non-empty file logs/toil_5afe8eb3-2e69-47a4-aa37-6802bd7a6bd2.6.3865603.err.log:
[2024-07-03T13:31:46-0700] [MainThread] [W] [toil.leader] Log from job "kind-WDLTaskJob/instance-g25utj5q" follows:
=========>
    Traceback (most recent call last):
      File "/private/home/anovak/workspace/toil/venv/bin/_toil_worker", line 33, in <module>
        sys.exit(load_entry_point('toil', 'console_scripts', '_toil_worker')())
      File "/private/home/anovak/workspace/toil/src/toil/worker.py", line 772, in main
        job_store = Toil.resumeJobStore(options.jobStoreLocator)
      File "/private/home/anovak/workspace/toil/src/toil/common.py", line 1036, in resumeJobStore
        jobStore.resume()
      File "/private/home/anovak/workspace/toil/src/toil/jobStores/fileJobStore.py", line 130, in resume
        raise NoSuchJobStoreException(self.jobStoreDir, "file")
    toil.jobStores.abstractJobStore.NoSuchJobStoreException: The job store 'file:/data/tmp/tmpjn0glxf6/tree' does not exist, so there is nothing to restart.
<=========

The error message shouldn't say there is nothing to restart if it happens when a worker is tryign to connect to the job store; it should say something else.

Also, we should make it harder for the user to get into this situation where Toil has selected a job store path that can't work. Maybe when the batch system is Slurm or one of the other grid engine ones, toil-wdl-runner should pick a default job store in the current directory?

We could also give the worker a special magic exit code to use to complain specifically that it can't reach the job store, prompting a useful error message from the leader.

┆Issue is synchronized with this Jira Story
┆Issue Number: TOIL-1611

@unito-bot unito-bot added the wdl label Jul 3, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants