Scripts for building cohorts of similar genetic ancestry across biobanks
project-recruitment
|
│-- README.md
|
|---ancestry-matching
│ │-- pca-classification.py
│ │-- ancestry-matching.py
│ │-- aou-ancestry-matching.breakdown.pdf
│ │-- aou-ancestry-matching.pc1x2.pdf
Ancestry-matching using PCs built from genotyping data allows us to identify sample sets which share some amount of genetic background. Ancestry-matching is done using genetic similarity instead of racial categories per this report from The National Academies of Sciences, Engineering, and Medicine. Self-reported ethnicities are used to validate that results indicate majority-similar self-reported ethnicities between the UKB and AoU samples (in our case 'White-British' and 'White'), but are not used to filter individuals for inclusion in the matched AoU sample.
- Project both UKB and AoU datasets into the GBMI PC space. Instructions here (requires PLINK). Note that UKB uses reference genome GRCh37 and AoU uses reference genome GRCh38.
- Train a machine learning model to classify your individuals as either ancestry-matched to your UKB sample (1) or not ancestry-matched (0) using the
pca-classification.pyscript function--train.
Input is a file with the PCA projections for the UKB individuals plus their classification as either in or out of the sample you are matching to.
Output is the pickle files for the models. - Use the best model from 2. to predict the classification of the AoU individuals using the
--predictfunction of thepca-classification.pyscript.
Inputs are the PCA projections for the AoU individuals and the pickle file for the model you want to use for projection.
Output is the classifications for the AoU individuals. - Validate that the ancestry matching performed as expected (is largely in line with self-reported ethnicities) using the
ancestry-matching.pyscript.
Inputs are the PC projections (from 1.) and the classifications (from 3.) for the AoU sample.
Output is a figure showing AoU PC1 x PC2 colored by self-reported ethnicity (e.g.aou-ancestry-matching.pc1x2.pdf) and a stacked bar plot of self-reported ethnicity vs. classification by ML model (e.g.aou-ancestry-matching.breakdown.pdf)