Skip to content
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
127 changes: 85 additions & 42 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -44,9 +44,37 @@ Or alternatively run this command:

Please note there is another package called spectra which is not related to this tool. Spectrae (which stands for spectral evaluation) implements the spectral framework for model evaluation.

## Definition of terms

This work and GitHub repository use terms related to the **spectral framework for model evaluation**. Below is a quick refresher on these key concepts.

### **Spectral Property**
Every dataset has an underlying property that, as it changes, causes model performance to decrease. This is referred to as the **spectral property**.

However, **not every property qualifies as a spectral property**.
For example:
- When predicting protein structure, the performance of a protein folding model does **not** change based on the number of **M** amino acids in a sequence.
- Instead, model performance **does** change based on **structural similarity**—this is an example of a **spectral property**.

### **Spectral Property Graph (SPG)**
For a given dataset, a **spectral property graph (SPG)** is defined as:
- **Nodes**: Samples in the dataset.
- **Edges**: Connections between samples that share a spectral property.

Every SPG is defined by a flattened adjacency matrix, this saves memory and allowed SPECTRA to utilize GPUs to speed up computation.

### **Spectral Parameter**
The **spectral parameter** can be thought of as a **sparsification probability**.

When SPECTRA runs on an SPG:
1. It selects a random node.
2. It decides whether to **delete edges** with a certain probability—this probability is the **spectral parameter**.
3. The closer the spectral parameter is to **1**, the **stricter** the splits generated by SPECTRA will be.


## How to use spectra

### Step 1: Define the spectral property, cross-split overlap, and the spectra dataset wrapper
### Step 1: Define the spectral property and the spectra dataset wrapper

To run spectra you must first define important two abstract classes, Spectra and SpectraDataset.

Expand Down Expand Up @@ -86,7 +114,7 @@ class [Name]_Dataset(SpectraDataset):
pass
```

Spectra implements the user definition of the spectra property and cross split overlap.
Spectra implements the user definition of the spectra property.


```python
Expand All @@ -103,52 +131,62 @@ class [Name]_spectra(spectra):
'''
return similarity

def cross_split_overlap(self, train, test):
'''
Define this function to return the overlap between a list of train and test samples.
```
### Step 2: Initialize SPECTRA and calculate the flattened adjacency matrix

Example: Average pairwise similarity between train and test set protein sequences.
1. **Initialize SPECTRA**
- Initially, pass in no spectral property graph.

'''

2. **Pass SPECTRA and dataset into the `Spectra_Property_Graph_Constructor`**
- Additional arguments:
- **`num_chunks`**: If your dataset is very large, you can split up the construction into chunks to allow multiple jobs to compute similarity. This parameter controls the number of chunks.
- **`binary`**: If `True`, the similarity returns either `0` or `1`; otherwise, it returns a floating-point number.

return cross_split_overlap
```
### Step 2: Initialize SPECTRA and precalculate pairwise spectral properties
3. **Call `create_adjacency_matrix`**
- This function takes in the **chunk number** to calculate:
- If `num_chunks = 0`, the pairwise similarity is calculated in one go, so the input to `create_adjacency_matrix` should be `0`.
- If `num_chunks = 10`, the input should be the chunk number you want to calculate (e.g., `0` to `9`).

4. **Combine the adjacency matrices**
- Call `combine_adjacency_matrices()` in the graph constructor to combine all the adjacency matrices into a single matrix.

Initialize SPECTRA, passing in True or False to the binary argument if the spectral property returns a binary or continuous value. Then precalculate the pairwise spectral properties.

```python
init_spectra = [name]_spectra([name]_Dataset, binary = True)
init_spectra.pre_calculate_spectra_properties([name])
from spectrae import Spectral_Property_Graph_Constructor
spectra = [name]_spectra([name]_Dataset, spg=None)
construct_spg = Spectra_Property_Graph_Constructor(spectra, [name]_Dataset, num_chunks = 0, binary = [False/True])
construct_spg.create_adjacency_matrix(0)
construct_spg.combine_adjacency_matrices()
```
### Step 3: Initialize SPECTRA and precalculate pairwise spectral properties

Generate SPECTRA splits. The ```generate_spectra_splits``` function takes in 4 important parameters:
1. ```number_repeats```: the number of times to rerun SPECTRA for the same spectral parameter, the number of repeats must equal the number of seeds as each rerun uses a different seed.
2. ```random_seed```: the random seeds used by each SPECTRA rerun, [42, 44] indicates two reruns the first of which will use a random seed of 42, the second will use 44.
3. ```spectra_parameters```: the spectral parameters to run on, they must range from 0 to 1 and be string formatted to the correct number of significant figures to avoid float formatting errors.
4. ```force_reconstruct```: True to force the model to regenerate SPECTRA splits even if they have already been generated.

### Step 3: Generate SPECTRA Splits

```python
spectra_parameters = {'number_repeats': 3,
'random_seed': [42, 44, 46],
'spectral_parameters': ["{:.2f}".format(i) for i in np.arange(0, 1.05, 0.05)],
'force_reconstruct': True,
}
1. **Initialize the Spectral Property Graph**
- Pass in the flattened adjacency matrix you just generated to the Spectral_Property_Graph to create the spectral property graph.

init_spectra.generate_spectra_splits(**spectra_parameters)
2. **Recreate SPECTRA**
- Use the SPECTRA dataset along with the created spectral property graph to reinstantiate SPECTRA.

3. **Call `generate_spectra_split`** with the following arguments:
- **`spectra_param`**: The spectral parameter to run, must be between `0` and `1` (inclusive).
- **`degree_choosing`**: Only applicable to binary graphs; optimizes the algorithm by prioritizing deletion of nodes with a low degree first.
- **`num_splits`**: Number of splits to generate (usually `20`, which translates to spectral parameters between `0` and `1` in intervals of `0.05`).
- **`path_to_save`**: Location to store generated SPECTRA splits.
- **`debug_mode`**: Controls the amount of information to output.

```python
spg = Spectral_Property_Graph(FlattenedAdjacency("flattened_adjacency_matrix.pt"))
spectra = [name]_spectra(dataset, spg)
spectra.generate_spectra_split(spectra_param, degree_choosing = [True/False], num_splits = [int], path_to_save="", debug_mode = [True/False])
```

### Step 4: Investigate generated SPECTRA splits

After SPECTRA has completed, the user should investigate the generated splits. Specifically ensuring that on average the cross-split overlap decreases as the spectral parameter increases. This can be achieved by using ```return_all_split_stats``` to gather the cross_split_overlap, train size, and test size of each generated split. Example outputs can be seen in the tutorials.
After SPECTRA has completed, the user should investigate the generated splits. Specifically ensuring that on average the cross-split overlap decreases as the spectral parameter increases. This can be achieved by using ```return_all_split_stats``` to gather the cross_split_overlap, train size, and test size of each generated split. Example outputs can be seen in the tutorials. The path_to_save should be the same path you used in the previous step.

```python
stats = init_spectra.return_all_split_stats()
plt.scatter(stats['SPECTRA_parameter'], stats['cross_split_overlap'])
spectra.return_all_split_stats(show_progress = True, path_to_save = save_path)
```

## Spectra tutorials
Expand All @@ -163,7 +201,7 @@ If there are any other tutorials of interest feel free to raise an issue!

## Background

SPECTRA is from a preprint, for more information on the preprint, the method behind SPECTRA, and the initials studies conducted with SPECTRA, check out the paper folder.
SPECTRA is [published](https://rdcu.be/d2D0z) in Nature Machine Intelligence. For more code about the method behind SPECTRA and the initials studies conducted with SPECTRA, check out the paper folder.

## Discussion and Development

Expand All @@ -185,15 +223,15 @@ All development discussions take place on GitHub in this repo in the issue track

2. *I have a foundation model that is pre-trained on a large amount of data. It is not feasible to do pairwise calculations of SPECTRA properties. How can I use SPECTRA?*

It is still possible to run SPECTRA on the foundation model (FM) and the evaluation dataset. You would use SPECTRA on the evaluation dataset then train and evaluate the foundation model on each SPECTRA split (either through linear probing, fine-tuning, or any other strategy) to calculate the AUSPC. Then you would determine the cross-split overlap between the pre-training dataset and the evaluation dataset. You would repeat this for multiple evaluation datasets, until you could plot FM AUSPC versus cross-split overlap to the evaluation dataset. For more details on what this would look like check out the [publication](https://www.biorxiv.org/content/10.1101/2024.02.25.581982v1), specifically section 5 of the results section. If there is large interest in this FAQ I can release a tutorial on this, just raise an issue!
It is still possible to run SPECTRA on the foundation model (FM) and the evaluation dataset. You would use SPECTRA on the evaluation dataset then train and evaluate the foundation model on each SPECTRA split (either through linear probing, fine-tuning, or any other strategy) to calculate the AUSPC. Then you would determine the cross-split overlap between the pre-training dataset and the evaluation dataset. You would repeat this for multiple evaluation datasets, until you could plot FM AUSPC versus cross-split overlap to the evaluation dataset. For more details on what this would look like check out the [publication](https://rdcu.be/d2D0z), specifically section 5 of the results section. If there is large interest in this FAQ I can release a tutorial on this, just raise an issue!

3. *I have a foundation model that is pre-trained on a large amount of data and **I do not have access to the pre-training data**. How can I use SPECTRA?*

This is a bit more tricky but there are [recent publications](https://arxiv.org/abs/2402.03563) that show these foundation models can represent uncertainty in the hidden representations they produce and a model can be trained to predict uncertainty from these representations. This uncertainty could represent the spectral property comparison between the pre-training and evaluation datasets. Though more work needs to be done, porting this work over would allow the application of SPECTRA in these settings. Again if there is large interest in this FAQ I can release a tutorial on this, just raise an issue!

4. *SPECTRA takes a long time to run is it worth it?*

The pairwise spectral property comparison is computationally expensive, but only needs to be done once. Generated SPECTRA splits are important resources that should be released to the public so others can utlilize them without spending resources. For more details on the runtime of the method check out the [publication](https://www.biorxiv.org/content/10.1101/2024.02.25.581982v1), specifically section 6 of the results section. The computation can be sped up with cpu cores, which is a feature that will be released.
The pairwise spectral property comparison is computationally expensive, but only needs to be done once. Generated SPECTRA splits are important resources that should be released to the public so others can utlilize them without spending resources. For more details on the runtime of the method check out the [publication](https://rdcu.be/d2D0z), specifically section 6 of the results section. The computation can be sped up with cpu cores, which is a feature that will be released.

If there are any other questions please raise them in the issues and I can address them. I'll keep adding to the FAQ as common questions begin to surface.

Expand All @@ -206,15 +244,20 @@ SPECTRA is under the MIT license found in the LICENSE file in this GitHub reposi
Please cite this paper when referring to SPECTRA.

```
@article {spectra,
author = {Yasha Ektefaie and Andrew Shen and Daria Bykova and Maximillian Marin and Marinka Zitnik and Maha R Farhat},
title = {Evaluating generalizability of artificial intelligence models for molecular datasets},
elocation-id = {2024.02.25.581982},
year = {2024},
doi = {10.1101/2024.02.25.581982},
URL = {https://www.biorxiv.org/content/early/2024/02/28/2024.02.25.581982},
eprint = {https://www.biorxiv.org/content/early/2024/02/28/2024.02.25.581982.full.pdf},
journal = {bioRxiv}
@ARTICLE{Ektefaie2024,
title = "Evaluating generalizability of artificial intelligence models
for molecular datasets",
author = "Ektefaie, Yasha and Shen, Andrew and Bykova, Daria and Marin,
Maximillian G and Zitnik, Marinka and Farhat, Maha",
journal = "Nat. Mach. Intell.",
publisher = "Springer Science and Business Media LLC",
volume = 6,
number = 12,
pages = "1512--1524",
month = dec,
year = 2024,
copyright = "https://www.springernature.com/gp/researchers/text-and-data-mining",
language = "en"
}
```

Expand Down
5 changes: 3 additions & 2 deletions spectrae/__init__.py
Original file line number Diff line number Diff line change
@@ -1,2 +1,3 @@
from .spectra import Spectra
from .dataset import SpectraDataset
from .spectra import Spectra, Spectra_Property_Graph_Constructor
from .dataset import SpectraDataset
from .utils import Spectral_Property_Graph, FlattenedAdjacency, plot_split_stats
37 changes: 20 additions & 17 deletions spectrae/dataset.py
Original file line number Diff line number Diff line change
@@ -1,38 +1,41 @@
from abc import ABC, abstractmethod
from typing import List, Dict

class SpectraDataset(ABC):

def __init__(self, input_file, name):
self.input_file = input_file
self.name = name
self.samples = self.parse(input_file)

@abstractmethod
def sample_to_index(self, idx):
"""
Given a sample, return the data idx
"""
pass

self.sample_to_index = self.parse(input_file)
self.samples = list(self.sample_to_index.keys())
self.samples.sort()
self.index_map = {value: idx for idx, value in enumerate(self.samples)}

@abstractmethod
def parse(self, input_file):
def parse(self, input_file: str) -> Dict:
"""
Given a dataset file, parse the dataset file.
Make sure there are only unique entries!
Given a dataset file, parse the dataset file to return a dictionary mapping a sample ID to the data
"""
pass
raise NotImplementedError("Must implement parse method to use SpectraDataset, see documentation for more information")

@abstractmethod
def __len__(self):
"""
Return the length of the dataset
"""
pass
return len(self.samples)

@abstractmethod
def __getitem__(self, idx):
"""
Given a dataset idx, return the element at that index
"""
pass
if isinstance(idx, int):
return self.sample_to_index[self.samples[idx]]
return self.sample_to_index[idx]

def index(self, value):
"""
Given a value, return the index of that value
"""
if value not in self.index_map:
raise ValueError(f"{value} not in the dataset")
return self.index_map[value]
Loading
Loading