-
Notifications
You must be signed in to change notification settings - Fork 179
Clarified INFO END deprecation status #844
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
| ##INFO=<ID=NS,Number=1,Type=Integer,Description="Number of Samples With Data"> | ||
| ##INFO=<ID=NS,Number=1,Type=Integer,Description="Number of \begin{environment-name} | ||
| Samples | ||
| \end{environment-name} With Data"> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unintended \begin{environment-name} … edit here?
| It is recommended that VCFv4.5 files include END unless that VCF contains any record that could be misinterpreted by the presence of END. | ||
| That is, if there exists a sample or allele in which the END computed for that SVLEN or FORMAT LEN does not equal the maximum END, then no END should be present in any record that VCF. | ||
| This approachs maintains backwards compatibility for unproblematic VCFs while attempting to minimise the probability of downstream data errors by making problematic records not valid for earlier versions of VCF (END was required for $<$*$>$ symbolic alleles). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"approachs" should be "approaches".
| Those same tools will incorrectly interpret the size of the smaller symbolic structural variants and $<$*$>$ symbolic alleles when END is present. | ||
| It is recommended that VCFv4.5 files include END unless that VCF contains any record that could be misinterpreted by the presence of END. | ||
| That is, if there exists a sample or allele in which the END computed for that SVLEN or FORMAT LEN does not equal the maximum END, then no END should be present in any record that VCF. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I find the current wording confusing. May I suggest rephrasing along the following lines:
-
Clarify that END is a derived field. If it is absent, it can be computed in such and such way.
(Therefore, not deprecated. Using the term deprecated raises unnecessary doubt: should newly written software still support END? The answer is yes, it must remain supported. So it’s better to avoid language that implies otherwise.) -
Clarify the handling of inconsistencies. I do not fully understand what the other paragraphs are trying to convey. My interpretation is that they intend to describe what happens if END is computed incorrectly or conflicts with the primary information. Practically speaking, the responsibility lies with the producer to ensure consistency, and each program may choose how to handle discrepancies. If an analysis relies on the END tag, it will not recompute it from the primary fields (then we would not END in the first place). Conversely, if an analysis works directly from the primary fields, it is expected it will ignore END, since END is derived.
-
Clarify the comparison of END and LEN. If a comparison between END and LEN is important, the text should explain explicitly in what ways the two differ and in what ways they are equivalent. Although I am fairly familiar with the VCF format, the current paragraph did not make this distinction clear.
Addresses concerns raised in #784