diff --git a/VCFv4.5.tex b/VCFv4.5.tex index bcd66c08..1b90cb55 100644 --- a/VCFv4.5.tex +++ b/VCFv4.5.tex @@ -46,7 +46,9 @@ \subsection{An example} ##reference=file:///seq/references/1000GenomesPilot-NCBI36.fasta ##contig= ##phasing=partial -##INFO= +##INFO= ##INFO= ##INFO= ##INFO= @@ -468,7 +470,7 @@ \subsubsection{Fixed fields} CIGAR & A & String & Cigar string describing how to align an alternate allele to the reference allele \\ DB & 0 & Flag & dbSNP membership \\ DP & 1 & Integer & Combined depth across samples \\ - END & 1 & Integer & Deprecated. Present for backwards compatibility with earlier versions of VCF. \\ + END & 1 & Integer & End position of the longest variant described in this record \\ H2 & 0 & Flag & HapMap2 membership \\ H3 & 0 & Flag & HapMap3 membership \\ MQ & 1 & Float & RMS mapping quality \\ @@ -482,15 +484,25 @@ \subsubsection{Fixed fields} \begin{itemize} \renewcommand{\labelitemii}{$\circ$} -\item END: Deprecated. -Retained for backwards compatibility with earlier versions of VCF and older VCF indexing software which rely on this field being present. +\item END: End position of the longest variant described in this record This is a computed field that, when present, must be set to the maximum end reference position (1-based) of: the position of the final base of the REF allele, the end position corresponding to the SVLEN of a symbolic SV allele, and the end positions calculated from FORMAT LEN for the $<$*$>$ symbolic allele. -The computed value of this field is used to compute BCF's {\tt rlen} field (see~\ref{BcfSiteEncoding}) and is important when indexing VCF/BCF files to enable random access and querying by position. +The computed value of this field is used to compute BCF's {\tt rlen} field (see~\ref{BcfSiteEncoding}). + +Whilst technically deprecated (INFO SVLEN and FORMAT LEN are the authoritative fields), END remains important for backwards compatibility. + +Unfortunately, the introduction of FORMAT LEN is not fully backwards compatible with END. +END is used for VCF indexing and a large ecosystem of pre-VCFv4.5 tools rely on END being present. +Those same tools will incorrectly interpret the size of the smaller symbolic structural variants and $<$*$>$ symbolic alleles when END is present. + +It is recommended that VCFv4.5 files include END unless that VCF contains any record that could be misinterpreted by the presence of END. +That is, if there exists a sample or allele in which the END computed for that SVLEN or FORMAT LEN does not equal the maximum END, then no END should be present in any record that VCF. +This approachs maintains backwards compatibility for unproblematic VCFs while attempting to minimise the probability of downstream data errors by making problematic records not valid for earlier versions of VCF (END was required for $<$*$>$ symbolic alleles). + \end{itemize} @@ -860,11 +872,11 @@ \section{INFO keys used for structural variants} \footnotesize \begin{verbatim} ##INFO= -##INFO= +##INFO= \end{verbatim} \normalsize -$END$ has been deprecated in favour of INFO SVLEN and FORMAT LEN. +Refer to section \ref{Fixed fields} for the definition of END. \footnotesize \begin{verbatim} @@ -891,7 +903,7 @@ \section{INFO keys used for structural variants} For backwards compatibility, a missing SVLEN should be inferred from the $END$ field. -For backwards compatibility, the absolute value of SVLEN should be taken and a negative SVLEN should be treated as positive values. +For backwards compatibility, the absolute value of SVLEN should be taken and a negative SVLEN should be treated as a positive value. Note that for structural variant symbolic alleles, $POS$ corresponds to the base immediately preceding the variant. @@ -1875,6 +1887,9 @@ \subsection{Representing unspecified alleles and REF-only blocks (gVCF)} \end{flushleft} \normalsize +Note that usage of both FORMAT LEN and INFO END can be problematic as pre-VCFv4.5 tools will misinterpret the reference block size for records containing samples with different block sizes. +See the definition of INFO END in section \ref{Fixed fields} for recommended behaviour. + When base modification information is present in the FORMAT field of a reference block record, the base modification information apply to all applicable bases covered by that reference block. \pagebreak @@ -2723,6 +2738,13 @@ \subsection{BCF2 block gzip and indexing} \section{List of changes} +\subsection{VCFv4.5 Errata} + +\begin{itemize} + \item Clarified INFO END deprecation status. +\end{itemize} + + \subsection{Changes between VCFv4.5 and VCFv4.4} \begin{itemize}