This repository contains:
feature_selection_experiments/run_experiment_feature_selection.py- Calculates feature importances for a single data split using Permutation Importance (PI)data/output_data/fe_results/agg.py- Calculation of the aggregated feature importances across all splitsfeature_selection_experiments/run_experiments_baselines.py- Evaluation of the selected features with the PI Method and the Visage Enhage Tool using cross-validation for TabPFN and NaiveBayesfeature_selection_experiments/run_experiments_allele_method.py- Code for feature selection based on the allele frequency method and evaluation of the resultsdata/input_data/full_data.csv-data/output_data- Results from our experimentsevaluation/run_plotting.py- Code for plotting the mean of ROC AUC, accuracy and logloss from cross-validation resultsevaluation/confusion_matrix.py- Code for plotting the confusion matrixstreamlit/Streamlit.py- local graphical user interface for the feature selection with PI method and evaluation
We recommend to use uv and Python 3.11 and a Linux OS. The tutorial below already integrates this into the
installation process.
#!/bin/bash
# Install uv globally (only needed once)
pip install uv
# Create a new virtual environment with Python 3.11
# You can change the path if you prefer a different location
uv venv --seed --python 3.11 ~/.venvs/feature_selection_BGA
# Activate the environment
source ~/.venvs/feature_selection_BGA/bin/activate
# Ensure uv is available inside the environment
pip install uv
# Install project dependencies from requirements.txt
uv pip install -r requirements.txt
# Install Ruff for linting and formatting
uv pip install ruff
source /work/dlclarge2/purucker-tabarena/venvs/fe/bin/activate && cd /work/dlclarge2/purucker-tabarena/code/BGA_Classification/feature_selection_experiments
sbatch --array=0-59%100 submit_gpu.sh feature_selection_experiments/run_experiment_feature_selection.py: Calculation of feature importances for one split (we did 60 splits overall)data/output_data/fe_results/agg.py: Calculate the aggregated feature importance from the results of 1.feature_selection_experiments/run_experiments_baselines.py: Evaluation of the selected features with cross-validation
streamlit run streamlit/Streamlit.py
1-s2.0-S1872497323000285-mmc5_EUR.csv: Example test and training data. This is a subset of the data published as a Supplemental (1-s2.0-S1872497323000285-mmc5.xlsx) of Ruiz-Ramirez et al.- The other data is from the 1000 Genomes Project.
Funded by the Deutsche Forschungsgemeinschaft (DFG, German Research Foundation) – Project-ID 499552394 – SFB 1597.