Predicting drug resistance using Machine Learning
I've documented some ideas here
https://github.com/abhi18av/drug-resistance-prediction-cambiohack/projects/1
-
Download and prepare the variant calling and drug resistance results - DONE
-
Download all
VCF
files for samples - DONE -
Download results of
tb-profiler
for these samples - DONE -
Syncronize these files as per common genome IDs - DONE
-
Filter out SNP from the
synced VCFs
- DONE -
Filter out resistance and lineage oriented fields from
synced tb-profiler
- DONE -
Merge the
filtered SNP from VCF
files - DONE
You can get the results of this stage through this link
https://1drv.ms/u/s!AtDyzJXLzSCVgaBRAOeffZf3Zi6QtA?e=bwp8P5
-
Do feature engineering to obtain a format suitable for machine learning
-
Split the final dataset into test-train data (30/70 split)
-
Train the
Random Forest
algorithm on the training dataset -
Check the accuracy as per AUC metric
-
Iterate on steps 7 - 10 till satisfactory results are achieved