Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Move file virtualization in toil-wdl-runner to task boundaries #5028

Open
wants to merge 41 commits into
base: master
Choose a base branch
from

Conversation

stxue1
Copy link
Contributor

@stxue1 stxue1 commented Jul 19, 2024

This should solve #5004

This also reverts #4994 and introduces another fix for #4988; there should be behavioral parity with miniwdl

This should also resolve #5031

This is a bit of an overhaul of how we handle files in toil-wdl-runner; instead of immediately virtualizing upon seeing a File type (coerced or not), the WDL value is kept as a path until the last second. Only before tasks are sent off, the files will be virtualized. This way, the WDL side functionality should still be the same while allowing for File to String coercion as we no longer replace all File paths with our virtualized representation immediately.

Monkeypatching is removed as we no longer virtualize at the coercion step, but before/after function calls and manually at task boundaries.

I don't think this will increase the amount of IO. One edge case is if the user changes the filename right before being put into a function; toil will virtualize the new filename for that function call and keep both the original and new virtualized file around in the jobstore until the job is completed. It may be worth removing the new virtualized file from the jobstore after use, but I'm not sure where to hook in the behavior.

The downside is that all jobs that are marked as local must always be local in the future. I explicitly set the flag in the class constructors where I could. I think the virtualization behavior should ensure local jobs on workers should work.

The function evaluate_bindings_from_decls is now used for all decl parsing/evaluation to reduce duplicative code.

Changelog Entry

To be copied to the draft changelog by merger:

  • File virtualization in toil-wdl-runner now only happens at task boundaries
    • File to String coercion should be supported

Reviewer Checklist

  • Make sure it is coming from issues/XXXX-fix-the-thing in the Toil repo, or from an external repo.
    • If it is coming from an external repo, make sure to pull it in for CI with:
      contrib/admin/test-pr otheruser theirbranchname issues/XXXX-fix-the-thing
      
    • If there is no associated issue, create one.
  • Read through the code changes. Make sure that it doesn't have:
    • Addition of trailing whitespace.
    • New variable or member names in camelCase that want to be in snake_case.
    • New functions without type hints.
    • New functions or classes without informative docstrings.
    • Changes to semantics not reflected in the relevant docstrings.
    • New or changed command line options for Toil workflows that are not reflected in docs/running/{cliOptions,cwl,wdl}.rst
    • New features without tests.
  • Comment on the lines of code where problems exist with a review comment. You can shift-click the line numbers in the diff to select multiple lines.
  • Finish the review with an overall description of your opinion.

Merger Checklist

  • Make sure the PR passes tests.
  • Make sure the PR has been reviewed since its last modification. If not, review it.
  • Merge with the Github "Squash and merge" feature.
    • If there are multiple authors' commits, add Co-authored-by to give credit to all contributing authors.
  • Copy its recommended changelog entry to the Draft Changelog.
  • Append the issue number in parentheses to the changelog entry.

…luation. Also gets rid of monkeypatching in favor of a manual function call
src/toil/wdl/wdltoil.py Outdated Show resolved Hide resolved
@stxue1 stxue1 marked this pull request as draft July 23, 2024 03:17
…alid coerced-to-null files and raise if exception found
@stxue1 stxue1 marked this pull request as ready for review July 25, 2024 01:14
@stxue1 stxue1 linked an issue Jul 25, 2024 that may be closed by this pull request
src/toil/wdl/wdltoil.py Outdated Show resolved Hide resolved
Copy link
Member

@adamnovak adamnovak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm concerned that by leaving relative URLs from the input JSON as relative URLs in WDL Files and then trying to figure out where they were meant to be relative to later, we've introduced a whole universe of cross-talk that we don't want to deal with. Maybe we can poll for where they ought to come from and turn them into absolute URLs before handing them to the workflow?

I like that we're ripping out the coercion monkey-patch and also nonexistent:, which simplifies things. But are we sure that we retain support for using Toil's supported URI schemes when MiniWDL reads a file? File the_file = "gs://google-bucket-name/filename.txt" ought to work if Toil has Google Storage support installed, even if MiniWDL doesn't know about reading Google Storage URLs. I think the coercion hook was doing that for us and I can't really tell if something is taking its place. If we're losing that feature and we think it is worth it, we should know we're losing it.

I'd also like to see a new comment at the top describing how the system works now, and what invariants or abstractions it imposes that new code is going to need to follow to keep it working. File holds a URL or leader-local path at the workflow level. When we actually read it with read_lines() at the workflow level, does MiniWDL fetch it? If we read_lines() it twice, do we have a cache? If we load it into the file store when it enters a task, and it is passed out of the task unmodified, do we see the original URL in the task output (because everything at workflow level is always an original URL) or the imported file store one?

Is talking about "virtualized" (visible to WDL code) and "devirtualized" (visible to Python open()) file names still the right way to understand what the code is doing? Or should we be using different terms or concepts to think about the packing and unpacking that happens when taking Bindings in and out of tasks?

I also like the idea of wdl_options to hold the global settings for the run, kind of like the CWL RuntimeContext. But I think it needs a type so we can have a better handle on what's allowed to be in there.

src/toil/wdl/wdltoil.py Outdated Show resolved Hide resolved
src/toil/wdl/wdltoil.py Outdated Show resolved Hide resolved
src/toil/wdl/wdltoil.py Outdated Show resolved Hide resolved
src/toil/wdl/wdltoil.py Outdated Show resolved Hide resolved
src/toil/wdl/wdltoil.py Outdated Show resolved Hide resolved
src/toil/wdl/wdltoil.py Outdated Show resolved Hide resolved
src/toil/wdl/wdltoil.py Outdated Show resolved Hide resolved
src/toil/wdl/wdltoil.py Show resolved Hide resolved
src/toil/wdl/wdltoil.py Outdated Show resolved Hide resolved
Comment on lines 1605 to 1625
if isinstance(value, WDL.Value.File):
pass
elif isinstance(value, WDL.Value.Array) and isinstance(expected_type, WDL.Type.Array):
for elem, orig_elem in zip(value.value, original_value.value):
map_over_files_in_value_check_null_type(elem, orig_elem, expected_type.item_type)
elif isinstance(value, WDL.Value.Map) and isinstance(expected_type, WDL.Type.Map):
for pair, orig_pair in zip(value.value, original_value.value):
# The key of the map cannot be optional or else it is not serializable, so we only need to check the value
map_over_files_in_value_check_null_type(pair[1], orig_pair[1], expected_type.item_type[1])
elif isinstance(value, WDL.Value.Pair) and isinstance(expected_type, WDL.Type.Pair):
map_over_files_in_value_check_null_type(value.value[0], original_value.value[0], expected_type.left_type)
map_over_files_in_value_check_null_type(value.value[1], original_value.value[1], expected_type.right_type)
elif isinstance(value, WDL.Value.Struct) and isinstance(expected_type, WDL.Type.StructInstance):
for (k, v), (_, orig_v) in zip(value.value.items(), original_value.value.items()):
# The parameters method for WDL.Type.StructInstance returns the values rather than the dictionary
# While dictionaries are ordered, this should be more robust; the else branch should never be hit
if expected_type.members is not None:
map_over_files_in_value_check_null_type(v, orig_v, expected_type.members[k])
elif isinstance(value, WDL.Value.Null):
if not expected_type.optional:
raise FileNotFoundError(errno.ENOENT, os.strerror(errno.ENOENT), original_value.value)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Having one version of this traversal logic is what the map function was supposed to be for. But I guess here we're traversing two mirrored structures in parallel so we can't use the other code.

@adamnovak
Copy link
Member

adamnovak commented Aug 1, 2024

@stxue1 Another thing I thought of is, if we need to make local=True always run a job on the leader and never run the job elsewhere, as part of this, then we need to remove the command lien option that lets you run local jobs on workers. Otherwise using it will break WDL workflows.

We also might need to revise the batch system documentation, and maybe the inheritance hierarchy, since you would need all batch systems, even those coming from plugins, to always have the local job logic.

@adamnovak
Copy link
Member

We think this will also fix #4992.

@adamnovak
Copy link
Member

@stxue1 If this fixes #5031, it also needs to enable the test from the conformance tests; it looks like this doesn't touch the tests file.

@stxue1
Copy link
Contributor Author

stxue1 commented Sep 6, 2024

@stxue1 Another thing I thought of is, if we need to make local=True always run a job on the leader and never run the job elsewhere, as part of this, then we need to remove the command lien option that lets you run local jobs on workers. Otherwise using it will break WDL workflows.

We seem to have a commandline options for CWL runCwlInternalJobsOnWorkers, but there doesn't seem to be a WDL equivalent option.

Copy link
Member

@adamnovak adamnovak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the WDL options setup is not designed quite right, and it also looks like we're able to import but then forget about whole copies of files.

# execution_dir: Directory to use as the working directory for workflow code.
# enforce_existence: If true, then if a file is detected as nonexistent, raise an error. Else, let it pass through
# share_files_with: If set to an existing standard library instance, use the same file upload and download paths as it.
WDL_OPTIONS = TypedDict('WDL_OPTIONS', {"execution_dir": NotRequired[str], "container": NotRequired[str],
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be WDLOptions since it is a type name, not a constant.

@@ -443,7 +460,17 @@ async def toil_read_source(uri: str, path: List[str], importer: Optional[WDL.Tre
raise FileNotFoundError(errno.ENOENT, os.strerror(errno.ENOENT), uri)



def virtualized_equal(file1: WDL.Value.Base, file2: WDL.Value.Base) -> bool:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

file1 and file2 aren't necessarily intended to themselves be Files, right? They are meant to be any WDL value that maybe contains Files. So probably the variable names should be something that reflects that.

src/toil/wdl/wdltoil.py Outdated Show resolved Hide resolved
src/toil/wdl/wdltoil.py Show resolved Hide resolved
src/toil/wdl/wdltoil.py Outdated Show resolved Hide resolved

def map_over_typed_files_in_binding(binding: WDL.Env.Binding[WDL.Value.Base], transform: Callable[[WDL.Type.Base, str], Optional[str]]) -> WDL.Env.Binding[WDL.Value.Base]:
def map_over_typed_files_in_binding(binding: WDL.Env.Binding[WDL.Value.Base], transform: Callable[[WDL.Value.File], Optional[WDL.Value.File]]) -> WDL.Env.Binding[WDL.Value.Base]:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And similarly these other _typed variants don't need that part of the name.

@@ -2221,7 +2412,7 @@ def run(self, file_store: AbstractFileStore) -> Promised[WDLBindings]:
# Set up the WDL standard library
# UUID to use for virtualizing files
# We process nonexistent files in WDLTaskWrapperJob as those must be run locally, so don't try to devirtualize them
standard_library = ToilWDLStdLibBase(file_store, self._task_path, enforce_existence=False)
standard_library = ToilWDLStdLibBase(file_store, wdl_options={"task_path": self._wdl_options["task_path"]})
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need to throw away the rest of the options here?

command_wdl_options: WDL_OPTIONS = {"task_path": task_path}
if workdir_in_container is not None:
command_wdl_options["execution_dir"] = workdir_in_container
command_library = ToilWDLStdLibTaskCommand(file_store, task_container, wdl_options=command_wdl_options)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here also we're not matching the pattern of the WDL options being a sort of global or at least mostly-global context.

passed_down_bindings = incoming_bindings.enter_namespace(self._node.name)
task_path = self._wdl_options.get("task_path")
wdl_options = self._wdl_options
wdl_options["task_path"] = f'{task_path}.{self._node.name}'
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mutating the current wdl_options like this is going to affect what standard_library sees as the current task_path.

If we have to pass wdl_options down modified, I would recommend acting on a modified copy.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And if we are going to modify it as we go down the tree to update the task path, why don't we also put the namespace in it?

Comment on lines 101 to 108
# WDL options to pass into the WDL jobs and standard libraries
# task_path: Dotted WDL name of the part of the workflow this library is working for.
# execution_dir: Directory to use as the working directory for workflow code.
# enforce_existence: If true, then if a file is detected as nonexistent, raise an error. Else, let it pass through
# share_files_with: If set to an existing standard library instance, use the same file upload and download paths as it.
WDL_OPTIONS = TypedDict('WDL_OPTIONS', {"execution_dir": NotRequired[str], "container": NotRequired[str],
"share_files_with": NotRequired["ToilWDLStdLibBase"], "task_path": str})

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think the design here is right.

If we want this to be "options", it should contain only things that are global constants across the workflow execution, which here would be the execution_dir (leader working directory) and the container (which container runner to use).

I think it does make sense to have it also carry task_path so we don't need to keep passing the task path around through everything and manually giving it to the standard library. But then we should probably call this a wdl_context rather than wdl_options, and it would have a tree structure: you derive a child context that is more specific from a parent context that is more general, and you use it for part of the workflow tree, mirroring the dotted task name structure. Also in that case we would also want to carry namespace alongside task_path.

I think it makes no sense at all to have either enforce_existence or share_files_with as part of this. They don't apply to a whole workflow or to a subtree of a workflow; if you actually passed one of them into a job from above there'd be no way to sensibly respect it. They should go back to being separate arguments on the standard library/other file access functions.

If we make those changes, then we can have a nice controlled way in which we work with the options/context information. We always either pass it straight through, or make a copy of it and adjust it to make the task_path/namespace more specific. We only ever create it once, on the leader. Then we never have to worry about if we are looking at the real options/context passed down from the top or one that was made as a dict literal in a function call.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
2 participants