Codegen parser preserves filenames & line numbers in round trip #608

tetron · 2022-10-12T21:28:15Z

Schema salad uses the ruamel.yaml "round trip" YAML parser.

This parser preserves comments and line numbers by using ruamel.yaml.comments.CommentedMap ruamel.yaml.comments.CommentedSeq. These objects behave like Python maps/sequences, but have an additional field lc (which stands for "line column" I think), the lc contains information for both where the Map or Seq element started, as well as where each of its contained items start as well. In addition, we set our own filename field to track what file an object came from.

This information is used to give better CWL errors, so it is possible to communicate what part of the file contains a warning or error. Specifically, look at the SourceLine class, which is used to wrap a code block such that any uncaught exceptions will be re-thrown with additional line number information added to the message.

The purpose of Schema salad is to validate documents based against a schema. The primary user is CWL but the schema salad is intended to be general purpose.

Schema salad supports two ways of parsing and validating documents. The original way is to load the schema into a data structure and then use the ref_resolver.Loader.resolve_all followed by validate.validate methods. The newer way is to use generate Python code from the schema which implements the same logic. The benefit of the code generation approach is that the resulting parser is much, much faster.

However, if you want to "round trip" a CWL document by using the codegen parser (which is based on loading records into objects), then exporting it back to maps and sequences, you lose the line number information.

For this project, we want to preserve the line number and filename information so that if you re-export the document (using save()) it preserves, as best as possible, the original line/column and filename annotations for use by CWL. As a stretch goal, it would also be neat if it preserved the YAML comments (which are also recorded by the "CommentedMap" / "CommentedSeq" classes) so that using the ruamel round trip exporter included all the comments from the original document.

The code generator code can be found in python_codegen.py. The parsers are ultimately released in the cwl-utils project. Here's how the CWL parsers are generated:

https://github.com/common-workflow-language/cwl-utils#development

We're currently retaining the original CommentedMap in the _doc field but not doing anything with it, so one approach is to have the save() method use the annotations from _doc to annotate objects that are returned. Among other things, you'll need to return CommentedMap and CommentedSeq instead of Dict and List.

The text was updated successfully, but these errors were encountered:

mr-c added the help wanted label Oct 13, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Codegen parser preserves filenames & line numbers in round trip #608

Codegen parser preserves filenames & line numbers in round trip #608

tetron commented Oct 12, 2022 •

edited

Loading

Codegen parser preserves filenames & line numbers in round trip #608

Codegen parser preserves filenames & line numbers in round trip #608

Comments

tetron commented Oct 12, 2022 • edited Loading

tetron commented Oct 12, 2022 •

edited

Loading