-
Notifications
You must be signed in to change notification settings - Fork 240
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Allow CWL workflows to have jobs use all of a Slurm node's memory #5052
Conversation
…options and env vars
…ments use whole nodes' memory
I still need to manually test this to make sure it actually does what it is meant to do. |
I wrote a test for this and it does indeed seem to issue jobs that ask for whole Slurm nodes when I use the two new options together. I also fixed Slurm job cleanup when a workflow is killed; it wasn't doing that before because shutdown() wasn't doing any killing in |
@DailyDreaming Can you review this? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Some minor clean-up.
docs/appendices/environment_vars.rst
Outdated
@@ -131,6 +131,9 @@ There are several environment variables that affect the way Toil runs. | |||
| TOIL_GOOGLE_PROJECTID | The Google project ID to use when generating | | |||
| | Google job store names for tests or CWL workflows. | | |||
+----------------------------------+----------------------------------------------------+ | |||
| TOIL_SLURM_ALLOCATE_MEM | Whether to akllocate memory in Slurm with --mem. | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
akllocate
docs/appendices/environment_vars.rst
Outdated
@@ -143,6 +146,8 @@ There are several environment variables that affect the way Toil runs. | |||
| | jobs. | | |||
| | There is no default value for this variable. | | |||
+----------------------------------+----------------------------------------------------+ | |||
| TOIL_SLURM_TIME | Slurm job time limit, in [DD-]HH:MM:SS format. | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
An example might be useful (literally adding e.g. TOIL_SLURM_TIME="04:00:00" for 4 hours
, assuming that this produces a 4 hour time limit), or a link to SLURM docs if this is a direct alias for a slurm arg.
docs/cwl/running.rst
Outdated
|
||
Besides the normal Toil options and the options supported by cwltool, toil-cwl-runner adds some of its own options: | ||
|
||
--bypass-file-store Do not use Toil's file store and assume all paths are accessible in place from all nodes. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe a further explanation that files may not be redundantly copied to a potentially remote filestore, and this can speed up the workflow (as a reason for why someone might want this). Has the downside that the workflow is not restart-able from the filestore if it made progress and failed part-way through.
docs/running/cliOptions.rst
Outdated
--slurmTime SLURM_TIME | ||
Slurm job time limit, in [DD-]HH:MM:SS format. | ||
--slurmPE SLURM_PE Special partition to send Slurm jobs to if they ask | ||
for more than 1 CPU. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could this be explained further?
@@ -417,11 +433,11 @@ def issueBatchJob(self, command: str, jobDesc, job_environment: Optional[Dict[st | |||
else: | |||
gpus = jobDesc.accelerators |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This code looks like it was maybe left in and is a function now (i.e. self.count_needed_gpus)?
src/toil/options/cwl.py
Outdated
"--runImportsOnWorkers", "--run-imports-on-workers", | ||
action="store_true", | ||
default=False, | ||
help=suppress_help or "Run the file imports on a worker instead of the leader. This is useful if the leader is not optimized for high network performance." |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Space needed at the end of this sentence.
src/toil/options/cwl.py
Outdated
) | ||
parser.add_argument( | ||
"--importWorkersDisk", "--import-workers-disk", | ||
help=suppress_help or "Specify the amount of disk space an import worker will use. If file streaming for input files is not available," |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Space needed at the end of this sentence.
src/toil/test/cwl/cwlTest.py
Outdated
# Reap it | ||
child.wait() | ||
# The test passes | ||
return |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that this return
is a no-op?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I put it there to make it clear that we intend to go there and end the test. But if you think it makes more sense without the return
I'll drop it.
This should fix #4971. Instead of
--defaultMemory 0
, to run CWL jobs that lack their ownramMin
with a full Slurm node's memory, you would now pass--no-cwl-default-ram --slurmDefaultAllMem=True
.This might cause some new problems:
--defaultMemory
unless the user passes--no-cwl-default-ram
. Previously I think we were ignoring the spec and always using the Toil--defaultMemory
. This might break some workflow runs that used to work because of us giving them more memory than the spec said to.Also, #4971 says we're supposed to implement a real framework for doing this kind of memory expansion across all batch systems that support it. But I didn't want to add a new bool flag onto
Requirer
for such a specific purpose. Probably if we need it we should combine it with preemptible somehow into a tag/flag system. Or we could implement memory range requirements and allow the top of the range to be unbounded, or treat some threshold upper limit as "all the node's memory" in the Slurm batch system.Changelog Entry
To be copied to the draft changelog by merger:
--slurmDefaultAllMem
option to run jobs lacking their own memory requirements with Slurm's--mem=0
, so they get a whole node's memory.toil-cwl-runner
now has--no-cwl-default-ram
(and--cwl-default-ram
) to control whether the CWL spec's defaultramMin
is applied, or Toil's own default memory logic is used.--dont_allocate_mem
and--allocate_mem
options have been deprecated and replaced with--slurmAllocateMem
, which can beTrue
orFalse
.Reviewer Checklist
issues/XXXX-fix-the-thing
in the Toil repo, or from an external repo.camelCase
that want to be insnake_case
.docs/running/{cliOptions,cwl,wdl}.rst
Merger Checklist