toil-cwl-runner used to allow `--defaultMemory 0`, which has special meaning to Slurm, but now no longer does #4971

gmloose · 2024-06-17T08:27:50Z

I recently noticed that it is no longer possible to pass --defaultMemory 0 to toil-cwl-runner. There's a check in place that enforces its value to be 1 or more. AFAICS this poses a problem when using Slurm, because you need to pass --mem=0 to sbatch to instruct Slurm to use all available memory.

See also: https://matrix.to/#/!sSMDDqTXkxpNofyjTR:gitter.im/$2MQRCwBC00tLV7ODhHvTHt73jEccTQnm0fL3u7BYwPA?via=gitter.im&via=matrix.org

┆Issue is synchronized with this Jira Story
┆Issue Number: TOIL-1590

The text was updated successfully, but these errors were encountered:

adamnovak · 2024-06-17T15:32:33Z

According to the CWL spec on ResourceRequirement, jobs that don't have a ramMin are meant to get a default one of 256 MiB. So I think if Toil is letting --defaultMemory control that at all, it is deviating slightly from the spec, and if you write the workflow to need that to run you'll end up with a "CWL" workflow that only Toil can run.

I don't think support for Slurm's all-available-memory feature was ever an intended feature of toil-cwl-runner, and if we want to bring it back we probably have to design it.

The CWL way to do this would be to have the workflow set both ramMin and ramMax, and have Toil understand that (right now it just always uses ramMin), and then schedule the job on Slurm with a variable memory limit and expose back to it how much RAM it actually got. But Slurm I don't think can take a range like that, so I guess we'd have to have Toil launch Slurm jobs with memory 0, then inspect how much memory it actually has, kick back to the leader if it's less than the min to re-schedule with a flat memory limit of at least the min, and I guess waste anything above the job's max. Toil might also have to constrain variable-memory jobs to run one per node, because otherwise it looks like Slurm might schedule two jobs on the same node and let them both see all the node's memory.

I'm not sure that we could use this variable-memory machinery outside of Slurm, since I'm not sure any other schedulers have a similar affordance.

How do you need this to work for your workflow? Which CWL jobs were getting --defaultMemory's 0 limit, and which weren't? Does your Slurm cluster automatically give whole nodes to jobs, or are you adding extra arguments to make Toil ask for that, or are your jobs constrained to not all try and run on the same node at once by some other resource requirement like CPU?

Rather than building real support for variable memory limits, would it work to have Toil support a CWL extension requirement or hint to e.g. pass special Slurm flags for particular jobs? You could put --mem=0 --exclusive on particular CWL jobs, and Toil could be clever enough to suppress its own --mem option when it sees you have provided one. That might still break if Toil passes through its own idea of what a job's CWL memory limit was supposed to be to e.g. a Docker container, but I think with Singularity we rely on Slurm's own memory confinement and so it would probably work.

gmloose · 2024-06-18T08:16:29Z

The main reason for passing --defaultMemory 0 was to avoid that Toil would request the default, limited amount of memory. This part of our code was written quite a while ago. I noticed while browsing the help that toil-cwl-runner also has two "new" options, --dont_allocate_mem and --allocate_mem. The first one may probably do exactly what we need.

I also noticed a bug in the documentation. The documentation for --allocate_mem is identical to that for --dont_allocate_mem.

unito-bot · 2024-07-23T17:25:23Z

➤ Adam Novak commented:

Lon thinks a zero memory limit ought to be allowed, and represent not using a memory limit.

unito-bot · 2024-07-23T17:27:51Z

➤ Adam Novak commented:

If we take a 0 memory limit in Toil, we will want to not just pass it straight to the backing batch system, but pick one batch system’s behavior (Slurm, or Kubernetes, or something else) and implement that on top of all the other batch systems. So if 0 on Slurm gets you a whole node’s memory reserved, we would add machinery to the Kubernetes batch system to reserve a whole Kubernetes node’s memory.

unito-bot · 2024-07-23T17:29:42Z

➤ Adam Novak commented:

For batch systems we can’t (easily) support the desired behavior on, we should reject memory-0 jobs with some kind of not implemented error.

unito-bot · 2024-07-23T17:30:41Z

➤ Adam Novak commented:

We should also find out from the CWL folks if CWL has its own semantics for a 0 memory job.

mr-c · 2024-07-23T18:07:01Z

➤ Adam Novak commented:

We should also find out from the CWL folks if CWL has its own semantics for a 0 memory job.

Out-of-band configuration is not part of the CWL standards 😀

To model this scenario in a CWL ResourceRequirement one would set ramMin and ramMax to null (or leave them unset).

adamnovak changed the title ~~toil-cwl-runner does not allow --defaultMemory 0~~ toil-cwl-runner used to allow --defaultMemory 0, which has special meaning to Slurm, but now no longer does Jun 17, 2024

unito-bot assigned adamnovak Jul 23, 2024

adamnovak linked a pull request Aug 8, 2024 that will close this issue

Allow CWL workflows to have jobs use all of a Slurm node's memory #5052

Open

19 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

toil-cwl-runner used to allow `--defaultMemory 0`, which has special meaning to Slurm, but now no longer does #4971

toil-cwl-runner used to allow `--defaultMemory 0`, which has special meaning to Slurm, but now no longer does #4971

gmloose commented Jun 17, 2024 •

edited by unito-bot

Loading

adamnovak commented Jun 17, 2024

gmloose commented Jun 18, 2024

unito-bot commented Jul 23, 2024

unito-bot commented Jul 23, 2024

unito-bot commented Jul 23, 2024

unito-bot commented Jul 23, 2024

mr-c commented Jul 23, 2024

toil-cwl-runner used to allow --defaultMemory 0, which has special meaning to Slurm, but now no longer does #4971

toil-cwl-runner used to allow --defaultMemory 0, which has special meaning to Slurm, but now no longer does #4971

Comments

gmloose commented Jun 17, 2024 • edited by unito-bot Loading

adamnovak commented Jun 17, 2024

gmloose commented Jun 18, 2024

unito-bot commented Jul 23, 2024

unito-bot commented Jul 23, 2024

unito-bot commented Jul 23, 2024

unito-bot commented Jul 23, 2024

mr-c commented Jul 23, 2024

toil-cwl-runner used to allow `--defaultMemory 0`, which has special meaning to Slurm, but now no longer does #4971

toil-cwl-runner used to allow `--defaultMemory 0`, which has special meaning to Slurm, but now no longer does #4971

gmloose commented Jun 17, 2024 •

edited by unito-bot

Loading