Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use sentinel value to deal with optional coerced files #4994

Merged
merged 19 commits into from
Jul 19, 2024

Conversation

stxue1
Copy link
Contributor

@stxue1 stxue1 commented Jun 28, 2024

This closes #4988

Changelog Entry

To be copied to the draft changelog by merger:

  • toil-wdl-runner will not immediately error on nonexistent coerced files until outputted
    • File? type for string to file coercion is now supported (will be nullified)

Reviewer Checklist

  • Make sure it is coming from issues/XXXX-fix-the-thing in the Toil repo, or from an external repo.
    • If it is coming from an external repo, make sure to pull it in for CI with:
      contrib/admin/test-pr otheruser theirbranchname issues/XXXX-fix-the-thing
      
    • If there is no associated issue, create one.
  • Read through the code changes. Make sure that it doesn't have:
    • Addition of trailing whitespace.
    • New variable or member names in camelCase that want to be in snake_case.
    • New functions without type hints.
    • New functions or classes without informative docstrings.
    • Changes to semantics not reflected in the relevant docstrings.
    • New or changed command line options for Toil workflows that are not reflected in docs/running/{cliOptions,cwl,wdl}.rst
    • New features without tests.
  • Comment on the lines of code where problems exist with a review comment. You can shift-click the line numbers in the diff to select multiple lines.
  • Finish the review with an overall description of your opinion.

Merger Checklist

  • Make sure the PR passes tests.
  • Make sure the PR has been reviewed since its last modification. If not, review it.
  • Merge with the Github "Squash and merge" feature.
    • If there are multiple authors' commits, add Co-authored-by to give credit to all contributing authors.
  • Copy its recommended changelog entry to the Draft Changelog.
  • Append the issue number in parentheses to the changelog entry.

Copy link
Member

@adamnovak adamnovak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks pretty good, but I think it needs more documentation for the new parameters it adds. I also am concerned we might be missing a virtualization pass that we still need. (So some high-level documentation in a long comment about what our general plan is with regard to when files are or are not local paths, and when they are or are not null, and when they are or are not the magic sentinel URIs, might be useful.)

@@ -622,6 +624,8 @@ def evaluate_output_decls(output_decls: List[WDL.Tree.Decl], all_bindings: WDL.E
output_bindings: WDL.Env.Bindings[WDL.Value.Base] = WDL.Env.Bindings()
for output_decl in output_decls:
output_value = evaluate_decl(output_decl, all_bindings, standard_library)
drop_if_missing_with_workdir = partial(drop_if_missing, work_dir=getattr(standard_library, "_execution_dir"))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we want to make standard_library a ToilWDLStdLibBase so we don't need getattr.

Also, it starts with _ and so maybe isn't really meant to be accessed here; maybe we want to be calling a method that gets this instead? Or maybe we want to rename the attribute if we need it outside the stdlib class itself?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would probably be better if all standard_library's were typed as ToilWDLStdLibBase; this was mainly done for mypy as mypy thinks the type is WDL.StdLib.Base and not ToilWDLStdLibBase.

I can probably turn the gettr into a method, although this current logic may not be sufficient. Another issue I found is that this will coerce all nonexistent files to null, even if the specified type is not optional. For example:

...
output {
  File f = "nonexistent.txt"
}
...

This wasn't an issue previously as we ensured file existence at type coercion.

Copy link
Contributor Author

@stxue1 stxue1 Jul 3, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This issue actually is #5006

Comment on lines 725 to 728
def __init__(self, file_store: AbstractFileStore, execution_dir: Optional[str] = None, enforce_nonexistence: bool = True):
"""
Set up the standard library.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does enforce_nonexistence actually do? Does it cause things to not exist? What happens if we don't enforce it?

I find the name a bit confusing so it definitely needs some documentation.

Comment on lines 797 to 799
enforce_nonexistence: bool = True) -> str:
"""
Download or export a WDL virtualized filename/URL to the given directory.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We would also want to document enforce_nonexistence in this docstring.

@@ -888,6 +899,7 @@ def devirtualize_to(filename: str, dest_dir: str, file_source: Union[AbstractFil
logger.debug("Virtualized file %s is already a local path", filename)

if not os.path.exists(result):
# Devirtualizing an unvirtualized file means the file is coerced from a string and never used/virtualized
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the real point of this block is to catch cases where such a string has made it from one context to another without going through the virtualization/devirtualization that would ensure the file it refers to is actually available too.

Do we now expect to be passing a bunch of local paths to the devirtualize function? If we mostly expect to hit this block when we try to access a string filename that doesn't exist, more than we expect to hit it when a file has escaped its node without being virtualized, then maybe we want to change the resulting error message?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this comment was carried through accidentally from a previous implementation before using sentinel values.

Comment on lines 1419 to 1422
def drop_if_missing(value_type: WDL.Type.Base, filename: str, work_dir: str) -> Optional[str]:
"""
Return None if a file doesn't exist, or its path if it does.
"""
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The docstring here should document value_type and work_dir. It seems like maybe the filename is assumed to belong to a WDL value of the given type (which might be File or File?), and that relative paths are interpreted relative to work_dir, but that should all be in the docstring.

Also probably filename should be the first argument, since that's what we're really operating on, and then value_type which is extra information about this file for error reporting, and finally work_dir since that's environment-level information that would stay the same over multiple calls (and could even sensibly have a default value of ".").

I guess we got this argument order from map_over_typed_files_in_bindings, so maybe we just need to leave it. Or we could change the order map_over_typed_files_in_bindings passes things in.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changing the argument order of map_over_typed_files_in_value may require tracking down all transform functions, so maybe I'll leave the current argument order. Maybe there's a reason those transform functions wanted the type first?

Comment on lines 1809 to 1827
run_job = WDLTaskJob(self._task, virtualize_files(bindings, standard_library), virtualize_files(runtime_bindings, standard_library), self._task_id, self._namespace, self._task_path, cores=runtime_cores or self.cores, memory=runtime_memory or self.memory, disk=runtime_disk or self.disk, accelerators=runtime_accelerators or self.accelerators, wdl_options=self._wdl_options)
run_job = WDLTaskJob(self._task, bindings, runtime_bindings, self._task_id, self._namespace, self._task_path, cores=runtime_cores or self.cores, memory=runtime_memory or self.memory, disk=runtime_disk or self.disk, accelerators=runtime_accelerators or self.accelerators, wdl_options=self._wdl_options)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't we still need to virtualize here? What if one of the task input defaults or body decls is like File the_file = write_lines(["a"]) and then we want to use its path in the command? It needs to get shipped from the job that evaluated the decls to the job that runs the command.

Comment on lines 2062 to 2063
# Since we process nonexistent files in WDLTaskWrapperJob as those must be run locally, don't try to devirtualize them
standard_library = ToilWDLStdLibBase(file_store, enforce_nonexistence=False)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This sounds like enforce_nonexistence might be the wrong name; it really is enforcing existence or detecting nonexistence as an error.

Comment on lines 3299 to 3301
def monkeypatch_coerce(standard_library: ToilWDLStdLibBase, null_nonexistent_files: bool = False) -> Generator[None, None, None]:
"""
Monkeypatch miniwdl's WDL.Value.Base.coerce() function to virtualize files when they are represented as Strings.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We probably should document null_nonexistent_files here. What's the default behavior for nonexistent files when it isn't set?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I forgot to remove this from an earlier implementation.

@@ -574,6 +575,7 @@ def recursive_dependencies(root: WDL.Tree.WorkflowNode) -> Set[str]:
# in the same destination directory, when dealing with basename conflicts.

TOIL_URI_SCHEME = 'toilfile:'
TOIL_NONEXISTENT_URI_SCHEME = 'nonexistent:'
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should explain our clever plan of using a whole bunch of different nonexistent:blah URIs to represent distinct nonexistent files somewhere, including some kind of reference back to the WDL spec for why we need it. Maybe here?

Otherwise people will see this and scratch their heads and be tempted to simplify it down to just None or a single nonexistent sentinel value, since they won't know why that can't work.

Copy link
Member

@adamnovak adamnovak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think my comments have been addressed.

Comment on lines +1445 to +1446
filename represents a URI or file name belonging to a WDL value of type value_type. work_dir represents
the current working directory of the job and is where all relative paths will be interpreted from
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These could be :param: directives.

map_over_files_in_bindings(environment, lambda x: paths.append(x))

def append_to_paths(path: str) -> Optional[str]:
# Append element and return the element. This is to avoid a logger warning inside map_over_typed_files_in_value()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we should just stop issuing that warning if we're having to return values we don't care about just to avoid it?

@stxue1 stxue1 merged commit 2ba2ba7 into master Jul 19, 2024
1 check passed
@stxue1 stxue1 deleted the issues/4988-defer-virtualization-for-coerced-files branch July 19, 2024 23:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Defer file virtualization until use for WDL String-to-File coercion
2 participants