Skip to content
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
38 changes: 30 additions & 8 deletions VCFv4.5.tex
Original file line number Diff line number Diff line change
Expand Up @@ -46,7 +46,9 @@ \subsection{An example}
##reference=file:///seq/references/1000GenomesPilot-NCBI36.fasta
##contig=<ID=20,length=62435964,assembly=B36,md5=f126cdf8a6e0c7f379d618ff66beb2da,species="Homo sapiens",taxonomy=x>
##phasing=partial
##INFO=<ID=NS,Number=1,Type=Integer,Description="Number of Samples With Data">
##INFO=<ID=NS,Number=1,Type=Integer,Description="Number of \begin{environment-name}
Samples
\end{environment-name} With Data">
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unintended \begin{environment-name} … edit here?

##INFO=<ID=DP,Number=1,Type=Integer,Description="Total Depth">
##INFO=<ID=AF,Number=A,Type=Float,Description="Allele Frequency">
##INFO=<ID=AA,Number=1,Type=String,Description="Ancestral Allele">
Expand Down Expand Up @@ -468,7 +470,7 @@ \subsubsection{Fixed fields}
CIGAR & A & String & Cigar string describing how to align an alternate allele to the reference allele \\
DB & 0 & Flag & dbSNP membership \\
DP & 1 & Integer & Combined depth across samples \\
END & 1 & Integer & Deprecated. Present for backwards compatibility with earlier versions of VCF. \\
END & 1 & Integer & End position of the longest variant described in this record \\
H2 & 0 & Flag & HapMap2 membership \\
H3 & 0 & Flag & HapMap3 membership \\
MQ & 1 & Float & RMS mapping quality \\
Expand All @@ -482,15 +484,25 @@ \subsubsection{Fixed fields}

\begin{itemize}
\renewcommand{\labelitemii}{$\circ$}
\item END: Deprecated.
Retained for backwards compatibility with earlier versions of VCF and older VCF indexing software which rely on this field being present.
\item END: End position of the longest variant described in this record

This is a computed field that, when present, must be set to the maximum end reference position (1-based) of:
the position of the final base of the REF allele,
the end position corresponding to the SVLEN of a symbolic SV allele,
and the end positions calculated from FORMAT LEN for the $<$*$>$ symbolic allele.

The computed value of this field is used to compute BCF's {\tt rlen} field (see~\ref{BcfSiteEncoding}) and is important when indexing VCF/BCF files to enable random access and querying by position.
The computed value of this field is used to compute BCF's {\tt rlen} field (see~\ref{BcfSiteEncoding}).

Whilst technically deprecated (INFO SVLEN and FORMAT LEN are the authoritative fields), END remains important for backwards compatibility.

Unfortunately, the introduction of FORMAT LEN is not fully backwards compatible with END.
END is used for VCF indexing and a large ecosystem of pre-VCFv4.5 tools rely on END being present.
Those same tools will incorrectly interpret the size of the smaller symbolic structural variants and $<$*$>$ symbolic alleles when END is present.

It is recommended that VCFv4.5 files include END unless that VCF contains any record that could be misinterpreted by the presence of END.
That is, if there exists a sample or allele in which the END computed for that SVLEN or FORMAT LEN does not equal the maximum END, then no END should be present in any record that VCF.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I find the current wording confusing. May I suggest rephrasing along the following lines:

  • Clarify that END is a derived field. If it is absent, it can be computed in such and such way.
    (Therefore, not deprecated. Using the term deprecated raises unnecessary doubt: should newly written software still support END? The answer is yes, it must remain supported. So it’s better to avoid language that implies otherwise.)

  • Clarify the handling of inconsistencies. I do not fully understand what the other paragraphs are trying to convey. My interpretation is that they intend to describe what happens if END is computed incorrectly or conflicts with the primary information. Practically speaking, the responsibility lies with the producer to ensure consistency, and each program may choose how to handle discrepancies. If an analysis relies on the END tag, it will not recompute it from the primary fields (then we would not END in the first place). Conversely, if an analysis works directly from the primary fields, it is expected it will ignore END, since END is derived.

  • Clarify the comparison of END and LEN. If a comparison between END and LEN is important, the text should explain explicitly in what ways the two differ and in what ways they are equivalent. Although I am fairly familiar with the VCF format, the current paragraph did not make this distinction clear.

This approachs maintains backwards compatibility for unproblematic VCFs while attempting to minimise the probability of downstream data errors by making problematic records not valid for earlier versions of VCF (END was required for $<$*$>$ symbolic alleles).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"approachs" should be "approaches".



\end{itemize}

Expand Down Expand Up @@ -860,11 +872,11 @@ \section{INFO keys used for structural variants}
\footnotesize
\begin{verbatim}
##INFO=<ID=NOVEL,Number=0,Type=Flag,Description="Indicates a novel structural variation">
##INFO=<ID=END,Number=1,Type=Integer,Description="Deprecated. Present for backwards compatibility with earlier versions of VCF.">
##INFO=<ID=END,Number=1,Type=Integer,Description="End position of the longest variant described in this record">
\end{verbatim}
\normalsize

$END$ has been deprecated in favour of INFO SVLEN and FORMAT LEN.
Refer to section \ref{Fixed fields} for the definition of END.

\footnotesize
\begin{verbatim}
Expand All @@ -891,7 +903,7 @@ \section{INFO keys used for structural variants}

For backwards compatibility, a missing SVLEN should be inferred from the $END$ field.

For backwards compatibility, the absolute value of SVLEN should be taken and a negative SVLEN should be treated as positive values.
For backwards compatibility, the absolute value of SVLEN should be taken and a negative SVLEN should be treated as a positive value.

Note that for structural variant symbolic alleles, $POS$ corresponds to the base immediately preceding the variant.

Expand Down Expand Up @@ -1875,6 +1887,9 @@ \subsection{Representing unspecified alleles and REF-only blocks (gVCF)}
\end{flushleft}
\normalsize

Note that usage of both FORMAT LEN and INFO END can be problematic as pre-VCFv4.5 tools will misinterpret the reference block size for records containing samples with different block sizes.
See the definition of INFO END in section \ref{Fixed fields} for recommended behaviour.

When base modification information is present in the FORMAT field of a reference block record, the base modification information apply to all applicable bases covered by that reference block.

\pagebreak
Expand Down Expand Up @@ -2723,6 +2738,13 @@ \subsection{BCF2 block gzip and indexing}

\section{List of changes}

\subsection{VCFv4.5 Errata}

\begin{itemize}
\item Clarified INFO END deprecation status.
\end{itemize}


\subsection{Changes between VCFv4.5 and VCFv4.4}

\begin{itemize}
Expand Down