Sanofi-Public
diff --git a/‎LICENSE.txt
+11 b/‎LICENSE.txt
+11
diff --git a/‎README.md
+74-1 b/‎README.md
+74-1
diff --git a/‎config/config.py
+2 b/‎config/config.py
+2
diff --git a/‎config/explainability_asthma.yaml
+13 b/‎config/explainability_asthma.yaml
+13
diff --git a/‎data/dummy_data.parquet
4.56 KB b/‎data/dummy_data.parquet
4.56 KB
diff --git a/‎figures/clinical_tokens_explained.png
341 KB b/‎figures/clinical_tokens_explained.png
341 KB
diff --git a/‎figures/lab_markers_explained.png
352 KB b/‎figures/lab_markers_explained.png
352 KB
diff --git a/‎figures/patient_explained.png
100 KB b/‎figures/patient_explained.png
100 KB
@@ -0,0 +1,11 @@
+Non Commercial License Notice: 
+
+Copyright Sanofi 2024
+
+Permission is hereby granted, free of charge, for academic research purposes only and for non-commercial uses only, to any person from academic research or non-profit organizations obtaining a copy of this software and associated documentation files (the "Software"), to use, copy, modify, or merge the Software, subject to the following conditions: this permission notice shall be included in all copies of the Software or of substantial portions of the Software. 
+
+For purposes of this license, “non-commercial use” excludes uses foreseeably resulting in a commercial benefit. To use this software for other purposes (such as the development of a commercial product, including but not limited to software, service, or pharmaceuticals, or in a collaboration with a private company), please contact SANOFI at [email protected]. 
+
+All other rights are reserved. The Software is provided “as is”, without warranty of any kind, express or implied, including the warranties of noninfringement. 
+
+The Software is registered.
@@ -1 +1,74 @@
-# ExplainClinicalBERT
+[![License](https://img.shields.io/badge/License-Academic%20Non--Commercial-blue.svg)](LICENSE)
+[![Python](https://img.shields.io/badge/Python-3.10%2B-blue.svg)](https://www.python.org/)
+[![PyTorch](https://img.shields.io/badge/Pytorch-2.2-orange.svg)](https://pytorch.org/)
+
+Generate Explanations for BERT Predictions on Structured Electronic Health Record Data
+======================================================================
+[](LICENSE)
+
+Recent breakthroughs in large language models have been reapplied to structured electronic health records (EHR). 
+
+**To explain clinical BERT model predictions, we present an approach which leverages integrated gradients to attribute events in medical records that lead to an outcome prediction.**
+<div style="display: flex; justify-content: space-around; align-items: center;">
+    <img src="figures/patient_explained.png" alt="Patient Explained" style="width: 30%;"/>
+    <img src="figures/lab_markers_explained.png" alt="Clinical Tokens Explained" style="width: 30%;"/>
+    <img src="figures/clinical_tokens_explained.png" alt="Lab Markers Explained" style="width: 30%;"/>
+</div>
+
+The explainability approach we have developed can be applied to many diseases and prediction tasks using language models trained on structured electronic health records.
+
+_ℹ️ This repository was created to compliment the manuscript "Predicting Progression and Key Drivers of Asthma with a Clinical BERT model and Integrated Gradients" which is available here: [coming soon]()_
+
+## [Pre-requisite] Training a MEDBERT Model
+The explainability pipeline requires a BERT-based model trained on structured EHR data and fine-tuned for the specific disease prediction task. 
+The pre-training procedure most closely follows the method described in the [TransformEHR](https://www.nature.com/articles/s41467-023-43715-z) paper, 
+and the fine-tuning is primarily based on the approach used in the [Med-BERT](https://doi.org/10.1038/s41746-021-00455-y) model.
+We have also provided a sample input and output data sample in `data/dummy_data.parquet` and `output/output.parquet` respectively. This will demonstrate the format you may expect if you chose to adopt this code. The `output.parquet` dataset can be further post-processed  to aggregate the top tokens for explainability.
+
+## Install
+```
+# Install conda environment
+conda create -n bert-explainability python=3.10 -y
+conda activate bert-explainability
+
+# Install dependencies
+pip install -r requirements.txt
+export PYTHONPATH=$PYTHONPATH:./src:./transformers_interpret
+```
+
+## Sample Script for Running the Pipeline
+```
+python3 -m src.explainability.explainability './config/explainability_asthma.yaml' './bert_finetuning_asthma_model.tar.gz' './data/' './output/'
+```
+
+## Input data format
+A small dummy dataset has been provided in `data/*` and `notebooks/example_walkthrough.ipynb`. The data is assumed to be parquet files stored locally or from s3, with the following schema: person_id (int), sorted_event_tokens (array<string>), day_position_tokens (array<int>).
+- person_id: A unique identifier for each individual.
+- day_position_tokens: An array representing the relative time (in days) of events, with 0 indicating demographic tokens.
+- sorted_event_tokens: A list of event codes associated with the individual. Each event corresponds to the relative date indicated by its index in the day_position_tokens array.
+  - The first five tokens are always assumed to be demographic tokens, in the order of age, ethnicity, gender, race, and region.
+- label: Label for the patient for a specific prediction task.
+
+## Config
+The config can be found in `config/explainability_asthma.yaml`. It can be modified for multiclass prediction purposes.
+It contains the following parameters:
+- model_max_len: Maximum token length for the model (e.g. 512).
+- training_label: Name of the label column.
+- internal_batch_size: Batch size used during processing. 
+- demographic_token_starters: Prefixes of tokens that belong to demographic categories.
+- avg_token_type_baseline: A boolean flag that determines if the baseline for lab test tokens with percentile information is the average percentile (True) or the 5th percentile (False).
+
+If doing multiclass predictions:
+ - num_labels: Number of classes.
+
+## Output data format
+The output is stored as an `output.parquet` file in the directory specified. 
+
+## Contacts
+For any inquiries please raise a git issue and we will try to follow-up in a timely manner.
+
+## License
+This work is available for academic research and non-commercial use only. See the _LICENSE_ file for details.
+
+## Acknowledgements
+This package utilizes functions from [transformers-interpret](https://github.com/cdpierse/transformers-interpret). All utilized functions are located in the `transformers_interpret/` subdirectory and are licensed under the Apache License Version 2.0.
@@ -0,0 +1,2 @@
+import os
+PROJECT_DIR = os.path.split(os.path.split(__file__)[0])[0]
@@ -0,0 +1,13 @@
+model_max_len: 512
+training_label: "label"
+internal_batch_size: 16
+demographic_token_starters:
+  - REGION
+  - GENDER
+  - ETHNICITY
+  - AGE
+  - RACE
+avg_token_type_baseline: false
+
+# if doing multiclass predictions:
+# num_labels: 3
Original file line number	Diff line number	Diff line change
`@@ -0,0 +1,2 @@`
	`1`	`+import os`
	`2`	`+PROJECT_DIR = os.path.split(os.path.split(__file__)[0])[0]`