Skip to content

Quality Metrics

Martin Maiers edited this page Jun 21, 2022 · 3 revisions

Quality metrics for haplotype frequency data

These are mean to characterize data so that thresholds can be consistently applied such as: only considering data with a RES_TRS_COUNT > x or only SAM_SIZE > y.

Many of these depend on having the genotypes, or some aspect of the genotypes, available.

QUAL_TYPE GENOTYPE BASED VALUE DESCRIPTION
DIV_ALPHA real, < 0 Exponent of Power Law fit to HTF distribution (This is called alpha in Slater et al. Power Laws for Heavy- Tailed Distributions)
DIV_50 X integer Number of haplotypes needed (in descending order of frequency) to have the cumulative sum be > 0.5 (Sample size sensitive!)
DIV_50_REL Real, 0 <= x <= 1 Number of haplotypes needed (in descending order of frequency) to have the cumulative sum be > 0.5 divided by the number of HT
SAM_SIZE X integer Number of GT
DIV_PGD X Real, 0 <= x <= 1 Population genetics diversity (1-sum f_i ^2 N/(N- 1)) where N = SAM_SIZE and f_i is the frequency of a specific haplotype
DIV_HEAVY_TAIL Real, 0 <= x <= 1 a is an independence parameter of the Bayesian SHF model that describes how allele frequency products correlate with haplotype frequencies (also correlates with the fraction of nonzero categories) – From Yoram SHF MS
RES_TRS_COUNT Real, 0 <= x <= 1 Average number of possible genotypes per individual
RES_TRS Real, 0 <= x <= 1 Typing Resolution Score – Average sum of square of genotype probabilities (imputation method dependent; ideally imputed from the HF estimated)
RES_SHARE_AMBIG X Real, 0 <= x <= 1 Fraction of GT with a lower resolution than defined in the resolution tag
RES_MISS_LOCI X Real, 0 <= x <= 1 Fraction of GT with missing loci (separate qual_type per locus?)
DEV_HWE X Real Deviation from HWE (using HWE with ambiguity method)
ERR_STD Real, 0 <= x <= 1 Weighted average of standard errors across all haplotypes
ERR_SAMP_80_100 Real Laurent, Excoffier “If” between frequencies derived from 100% set and 80% training set
SUM_FREQ_GAP Real Sum of haplotype frequencies for unobserved haplotypes that are expected in population by SHF model
ERR_OFFSET Real, 0 <= x <= 1 1-sum f_i (Difference between predicted full HF distribution using SHF versus actual including test set?)
LD_MEASURE Real Define in Method section – (Where is LD measured for quality?)
KFOLD_IMPUTE X Real, 0 <= x <= 1 % of imputable GT in 20% test set from HT generated in 80% training set
KFOLD_PRED_ACTUAL X Real, 0 <= x <= 1 Divergence between predicted and actual with Log Loss function (for test set predictions on simulated lower-resolution typings)
KFOLD_N integer Number of independent training-test folds (k)
Clone this wiki locally