Skip to content

Conversation

@d-cameron
Copy link
Contributor

Addresses concerns raised in #784

@github-actions
Copy link

github-actions bot commented Sep 9, 2025

Changed PDFs as of 5132c8b: VCFv4.5 (diff).

##INFO=<ID=NS,Number=1,Type=Integer,Description="Number of Samples With Data">
##INFO=<ID=NS,Number=1,Type=Integer,Description="Number of \begin{environment-name}
Samples
\end{environment-name} With Data">
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unintended \begin{environment-name} … edit here?

@jmarshall jmarshall added the vcf label Sep 9, 2025
It is recommended that VCFv4.5 files include END unless that VCF contains any record that could be misinterpreted by the presence of END.
That is, if there exists a sample or allele in which the END computed for that SVLEN or FORMAT LEN does not equal the maximum END, then no END should be present in any record that VCF.
This approachs maintains backwards compatibility for unproblematic VCFs while attempting to minimise the probability of downstream data errors by making problematic records not valid for earlier versions of VCF (END was required for $<$*$>$ symbolic alleles).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"approachs" should be "approaches".

Those same tools will incorrectly interpret the size of the smaller symbolic structural variants and $<$*$>$ symbolic alleles when END is present.
It is recommended that VCFv4.5 files include END unless that VCF contains any record that could be misinterpreted by the presence of END.
That is, if there exists a sample or allele in which the END computed for that SVLEN or FORMAT LEN does not equal the maximum END, then no END should be present in any record that VCF.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I find the current wording confusing. May I suggest rephrasing along the following lines:

  • Clarify that END is a derived field. If it is absent, it can be computed in such and such way.
    (Therefore, not deprecated. Using the term deprecated raises unnecessary doubt: should newly written software still support END? The answer is yes, it must remain supported. So it’s better to avoid language that implies otherwise.)

  • Clarify the handling of inconsistencies. I do not fully understand what the other paragraphs are trying to convey. My interpretation is that they intend to describe what happens if END is computed incorrectly or conflicts with the primary information. Practically speaking, the responsibility lies with the producer to ensure consistency, and each program may choose how to handle discrepancies. If an analysis relies on the END tag, it will not recompute it from the primary fields (then we would not END in the first place). Conversely, if an analysis works directly from the primary fields, it is expected it will ignore END, since END is derived.

  • Clarify the comparison of END and LEN. If a comparison between END and LEN is important, the text should explain explicitly in what ways the two differ and in what ways they are equivalent. Although I am fairly familiar with the VCF format, the current paragraph did not make this distinction clear.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants