Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

toil-cwl-runner used to allow --defaultMemory 0, which has special meaning to Slurm, but now no longer does #4971

Open
gmloose opened this issue Jun 17, 2024 · 7 comments · May be fixed by #5052
Assignees

Comments

@gmloose
Copy link
Contributor

gmloose commented Jun 17, 2024

I recently noticed that it is no longer possible to pass --defaultMemory 0 to toil-cwl-runner. There's a check in place that enforces its value to be 1 or more. AFAICS this poses a problem when using Slurm, because you need to pass --mem=0 to sbatch to instruct Slurm to use all available memory.

See also: https://matrix.to/#/!sSMDDqTXkxpNofyjTR:gitter.im/$2MQRCwBC00tLV7ODhHvTHt73jEccTQnm0fL3u7BYwPA?via=gitter.im&via=matrix.org

┆Issue is synchronized with this Jira Story
┆Issue Number: TOIL-1590

@adamnovak
Copy link
Member

According to the CWL spec on ResourceRequirement, jobs that don't have a ramMin are meant to get a default one of 256 MiB. So I think if Toil is letting --defaultMemory control that at all, it is deviating slightly from the spec, and if you write the workflow to need that to run you'll end up with a "CWL" workflow that only Toil can run.

I don't think support for Slurm's all-available-memory feature was ever an intended feature of toil-cwl-runner, and if we want to bring it back we probably have to design it.

The CWL way to do this would be to have the workflow set both ramMin and ramMax, and have Toil understand that (right now it just always uses ramMin), and then schedule the job on Slurm with a variable memory limit and expose back to it how much RAM it actually got. But Slurm I don't think can take a range like that, so I guess we'd have to have Toil launch Slurm jobs with memory 0, then inspect how much memory it actually has, kick back to the leader if it's less than the min to re-schedule with a flat memory limit of at least the min, and I guess waste anything above the job's max. Toil might also have to constrain variable-memory jobs to run one per node, because otherwise it looks like Slurm might schedule two jobs on the same node and let them both see all the node's memory.

I'm not sure that we could use this variable-memory machinery outside of Slurm, since I'm not sure any other schedulers have a similar affordance.

How do you need this to work for your workflow? Which CWL jobs were getting --defaultMemory's 0 limit, and which weren't? Does your Slurm cluster automatically give whole nodes to jobs, or are you adding extra arguments to make Toil ask for that, or are your jobs constrained to not all try and run on the same node at once by some other resource requirement like CPU?

Rather than building real support for variable memory limits, would it work to have Toil support a CWL extension requirement or hint to e.g. pass special Slurm flags for particular jobs? You could put --mem=0 --exclusive on particular CWL jobs, and Toil could be clever enough to suppress its own --mem option when it sees you have provided one. That might still break if Toil passes through its own idea of what a job's CWL memory limit was supposed to be to e.g. a Docker container, but I think with Singularity we rely on Slurm's own memory confinement and so it would probably work.

@adamnovak adamnovak changed the title toil-cwl-runner does not allow --defaultMemory 0 toil-cwl-runner used to allow --defaultMemory 0, which has special meaning to Slurm, but now no longer does Jun 17, 2024
@gmloose
Copy link
Contributor Author

gmloose commented Jun 18, 2024

The main reason for passing --defaultMemory 0 was to avoid that Toil would request the default, limited amount of memory. This part of our code was written quite a while ago. I noticed while browsing the help that toil-cwl-runner also has two "new" options, --dont_allocate_mem and --allocate_mem. The first one may probably do exactly what we need.

I also noticed a bug in the documentation. The documentation for --allocate_mem is identical to that for --dont_allocate_mem.

@unito-bot
Copy link

➤ Adam Novak commented:

Lon thinks a zero memory limit ought to be allowed, and represent not using a memory limit.

@unito-bot
Copy link

➤ Adam Novak commented:

If we take a 0 memory limit in Toil, we will want to not just pass it straight to the backing batch system, but pick one batch system’s behavior (Slurm, or Kubernetes, or something else) and implement that on top of all the other batch systems. So if 0 on Slurm gets you a whole node’s memory reserved, we would add machinery to the Kubernetes batch system to reserve a whole Kubernetes node’s memory.

@unito-bot
Copy link

➤ Adam Novak commented:

For batch systems we can’t (easily) support the desired behavior on, we should reject memory-0 jobs with some kind of not implemented error.

@unito-bot
Copy link

➤ Adam Novak commented:

We should also find out from the CWL folks if CWL has its own semantics for a 0 memory job.

@mr-c
Copy link
Contributor

mr-c commented Jul 23, 2024

➤ Adam Novak commented:

We should also find out from the CWL folks if CWL has its own semantics for a 0 memory job.

Out-of-band configuration is not part of the CWL standards 😀

To model this scenario in a CWL ResourceRequirement one would set ramMin and ramMax to null (or leave them unset).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants