Skip to content

harpak-lab/ancestry-matching

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 
 
 
 
 

Repository files navigation

Ancestry matching

Scripts for building cohorts of similar genetic ancestry across biobanks

Repository structure

project-recruitment
|
│-- README.md
|
|---ancestry-matching
│   │-- pca-classification.py
│   │-- ancestry-matching.py
│   │-- aou-ancestry-matching.breakdown.pdf
│   │-- aou-ancestry-matching.pc1x2.pdf

Motivation

Ancestry-matching using PCs built from genotyping data allows us to identify sample sets which share some amount of genetic background. Ancestry-matching is done using genetic similarity instead of racial categories per this report from The National Academies of Sciences, Engineering, and Medicine. Self-reported ethnicities are used to validate that results indicate majority-similar self-reported ethnicities between the UKB and AoU samples (in our case 'White-British' and 'White'), but are not used to filter individuals for inclusion in the matched AoU sample.

Steps

  1. Project both UKB and AoU datasets into the GBMI PC space. Instructions here (requires PLINK). Note that UKB uses reference genome GRCh37 and AoU uses reference genome GRCh38.
  2. Train a machine learning model to classify your individuals as either ancestry-matched to your UKB sample (1) or not ancestry-matched (0) using the pca-classification.py script function --train.
    Input is a file with the PCA projections for the UKB individuals plus their classification as either in or out of the sample you are matching to.
    Output is the pickle files for the models.
  3. Use the best model from 2. to predict the classification of the AoU individuals using the --predict function of the pca-classification.py script.
    Inputs are the PCA projections for the AoU individuals and the pickle file for the model you want to use for projection.
    Output is the classifications for the AoU individuals.
  4. Validate that the ancestry matching performed as expected (is largely in line with self-reported ethnicities) using the ancestry-matching.py script.
    Inputs are the PC projections (from 1.) and the classifications (from 3.) for the AoU sample.
    Output is a figure showing AoU PC1 x PC2 colored by self-reported ethnicity (e.g. aou-ancestry-matching.pc1x2.pdf) and a stacked bar plot of self-reported ethnicity vs. classification by ML model (e.g. aou-ancestry-matching.breakdown.pdf)

About

Regarding confounding in biobanks based on recruitment methodology

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages