mims-harvard · yashaektefaie · Dec 21, 2024 · Dec 23, 2024 · Dec 23, 2024 · Dec 23, 2024
diff --git a/README.md b/README.md
@@ -44,9 +44,37 @@ Or alternatively run this command:
 
 Please note there is another package called spectra which is not related to this tool. Spectrae (which stands for spectral evaluation) implements the spectral framework for model evaluation.
 
+## Definition of terms
+
+This work and GitHub repository use terms related to the **spectral framework for model evaluation**. Below is a quick refresher on these key concepts.
+
+### **Spectral Property**
+Every dataset has an underlying property that, as it changes, causes model performance to decrease. This is referred to as the **spectral property**.  
+
+However, **not every property qualifies as a spectral property**.  
+For example:
+- When predicting protein structure, the performance of a protein folding model does **not** change based on the number of **M** amino acids in a sequence.
+- Instead, model performance **does** change based on **structural similarity**—this is an example of a **spectral property**.
+
+### **Spectral Property Graph (SPG)**
+For a given dataset, a **spectral property graph (SPG)** is defined as:
+- **Nodes**: Samples in the dataset.
+- **Edges**: Connections between samples that share a spectral property.
+
+Every SPG is defined by a flattened adjacency matrix, this saves memory and allowed SPECTRA to utilize GPUs to speed up computation.
+
+### **Spectral Parameter**
+The **spectral parameter** can be thought of as a **sparsification probability**.  
+
+When SPECTRA runs on an SPG:
+1. It selects a random node.
+2. It decides whether to **delete edges** with a certain probability—this probability is the **spectral parameter**.
+3. The closer the spectral parameter is to **1**, the **stricter** the splits generated by SPECTRA will be.
+
+
 ## How to use spectra
 
-### Step 1: Define the spectral property, cross-split overlap, and the spectra dataset wrapper
+### Step 1: Define the spectral property and the spectra dataset wrapper
 
 To run spectra you must first define important two abstract classes, Spectra and SpectraDataset. 
 
@@ -86,7 +114,7 @@ class [Name]_Dataset(SpectraDataset):
         pass
 ```
 
-Spectra implements the user definition of the spectra property and cross split overlap.
+Spectra implements the user definition of the spectra property.
 
 
 ```python 
@@ -103,52 +131,62 @@ class [Name]_spectra(spectra):
         '''
         return similarity
 
-    def cross_split_overlap(self, train, test):
-        '''
-            Define this function to return the overlap between a list of train and test samples.
+```
+### Step 2: Initialize SPECTRA and calculate the flattened adjacency matrix
 
-            Example: Average pairwise similarity between train and test set protein sequences.
+1. **Initialize SPECTRA**  
+   - Initially, pass in no spectral property graph.
 
-        '''
-
+2. **Pass SPECTRA and dataset into the `Spectra_Property_Graph_Constructor`**  
+   - Additional arguments:
+     - **`num_chunks`**: If your dataset is very large, you can split up the construction into chunks to allow multiple jobs to compute similarity. This parameter controls the number of chunks.
+     - **`binary`**: If `True`, the similarity returns either `0` or `1`; otherwise, it returns a floating-point number.
 
-        return cross_split_overlap
-```
-### Step 2: Initialize SPECTRA and precalculate pairwise spectral properties
+3. **Call `create_adjacency_matrix`**  
+   - This function takes in the **chunk number** to calculate:
+     - If `num_chunks = 0`, the pairwise similarity is calculated in one go, so the input to `create_adjacency_matrix` should be `0`.
+     - If `num_chunks = 10`, the input should be the chunk number you want to calculate (e.g., `0` to `9`).
+
+4. **Combine the adjacency matrices**  
+   - Call `combine_adjacency_matrices()` in the graph constructor to combine all the adjacency matrices into a single matrix.
 
-Initialize SPECTRA, passing in True or False to the binary argument if the spectral property returns a binary or continuous value. Then precalculate the pairwise spectral properties.
 
 ```python
-init_spectra = [name]_spectra([name]_Dataset, binary = True)
-init_spectra.pre_calculate_spectra_properties([name])
+from spectrae import Spectral_Property_Graph_Constructor
+spectra = [name]_spectra([name]_Dataset, spg=None)
+construct_spg = Spectra_Property_Graph_Constructor(spectra, [name]_Dataset, num_chunks = 0, binary = [False/True])
+construct_spg.create_adjacency_matrix(0)
+construct_spg.combine_adjacency_matrices()
 ```
-### Step 3: Initialize SPECTRA and precalculate pairwise spectral properties
 
-Generate SPECTRA splits. The ```generate_spectra_splits``` function takes in 4 important parameters: 
-1. ```number_repeats```: the number of times to rerun SPECTRA for the same spectral parameter, the number of repeats must equal the number of seeds as each rerun uses a different seed. 
-2. ```random_seed```: the random seeds used by each SPECTRA rerun, [42, 44] indicates two reruns the first of which will use a random seed of 42, the second will use 44. 
-3. ```spectra_parameters```: the spectral parameters to run on, they must range from 0 to 1 and be string formatted to the correct number of significant figures to avoid float formatting errors.
-4. ```force_reconstruct```: True to force the model to regenerate SPECTRA splits even if they have already been generated.
 
+### Step 3: Generate SPECTRA Splits
 
-```python
-spectra_parameters = {'number_repeats': 3, 
-                      'random_seed': [42, 44, 46],
-                      'spectral_parameters': ["{:.2f}".format(i) for i in np.arange(0, 1.05, 0.05)],
-                      'force_reconstruct': True,
-                                              }
+1. **Initialize the Spectral Property Graph**  
+   - Pass in the flattened adjacency matrix you just generated to the Spectral_Property_Graph to create the spectral property graph.
 
-init_spectra.generate_spectra_splits(**spectra_parameters)
+2. **Recreate SPECTRA**  
+   - Use the SPECTRA dataset along with the created spectral property graph to reinstantiate SPECTRA.
 
+3. **Call `generate_spectra_split`** with the following arguments:  
+   - **`spectra_param`**: The spectral parameter to run, must be between `0` and `1` (inclusive).  
+   - **`degree_choosing`**: Only applicable to binary graphs; optimizes the algorithm by prioritizing deletion of nodes with a low degree first.  
+   - **`num_splits`**: Number of splits to generate (usually `20`, which translates to spectral parameters between `0` and `1` in intervals of `0.05`).  
+   - **`path_to_save`**: Location to store generated SPECTRA splits.  
+   - **`debug_mode`**: Controls the amount of information to output. 
+
+```python
+spg = Spectral_Property_Graph(FlattenedAdjacency("flattened_adjacency_matrix.pt"))
+spectra = [name]_spectra(dataset, spg)
+spectra.generate_spectra_split(spectra_param, degree_choosing = [True/False], num_splits = [int], path_to_save="", debug_mode = [True/False])
 ```
 
 ### Step 4: Investigate generated SPECTRA splits
 
-After SPECTRA has completed, the user should investigate the generated splits. Specifically ensuring that on average the cross-split overlap decreases as the spectral parameter increases. This can be achieved by using ```return_all_split_stats``` to gather the cross_split_overlap, train size, and test size of each generated split. Example outputs can be seen in the tutorials. 
+After SPECTRA has completed, the user should investigate the generated splits. Specifically ensuring that on average the cross-split overlap decreases as the spectral parameter increases. This can be achieved by using ```return_all_split_stats``` to gather the cross_split_overlap, train size, and test size of each generated split. Example outputs can be seen in the tutorials. The path_to_save should be the same path you used in the previous step.
 
 ```python
-stats = init_spectra.return_all_split_stats()
-plt.scatter(stats['SPECTRA_parameter'], stats['cross_split_overlap'])
+spectra.return_all_split_stats(show_progress = True, path_to_save = save_path)
 ```
 
 ## Spectra tutorials
@@ -163,7 +201,7 @@ If there are any other tutorials of interest feel free to raise an issue!
 
 ## Background
 
-SPECTRA is from a preprint, for more information on the preprint, the method behind SPECTRA, and the initials studies conducted with SPECTRA, check out the paper folder. 
+SPECTRA is [published](https://rdcu.be/d2D0z) in Nature Machine Intelligence. For more code about the method behind SPECTRA and the initials studies conducted with SPECTRA, check out the paper folder. 
 
 ## Discussion and Development
 
@@ -185,15 +223,15 @@ All development discussions take place on GitHub in this repo in the issue track
 
 2. *I have a foundation model that is pre-trained on a large amount of data. It is not feasible to do pairwise calculations of SPECTRA properties. How can I use SPECTRA?*
 
-    It is still possible to run SPECTRA on the foundation model (FM) and the evaluation dataset. You would use SPECTRA on the evaluation dataset then train and evaluate the foundation model on each SPECTRA split (either through linear probing, fine-tuning, or any other strategy) to calculate the AUSPC. Then you would determine the cross-split overlap between the pre-training dataset and the evaluation dataset. You would repeat this for multiple evaluation datasets, until you could plot FM AUSPC versus cross-split overlap to the evaluation dataset. For more details on what this would look like check out the [publication](https://www.biorxiv.org/content/10.1101/2024.02.25.581982v1), specifically section 5 of the results section. If there is large interest in this FAQ I can release a tutorial on this, just raise an issue! 
+    It is still possible to run SPECTRA on the foundation model (FM) and the evaluation dataset. You would use SPECTRA on the evaluation dataset then train and evaluate the foundation model on each SPECTRA split (either through linear probing, fine-tuning, or any other strategy) to calculate the AUSPC. Then you would determine the cross-split overlap between the pre-training dataset and the evaluation dataset. You would repeat this for multiple evaluation datasets, until you could plot FM AUSPC versus cross-split overlap to the evaluation dataset. For more details on what this would look like check out the [publication](https://rdcu.be/d2D0z), specifically section 5 of the results section. If there is large interest in this FAQ I can release a tutorial on this, just raise an issue! 
 
 3. *I have a foundation model that is pre-trained on a large amount of data and **I do not have access to the pre-training data**. How can I use SPECTRA?*
 
     This is a bit more tricky but there are [recent publications](https://arxiv.org/abs/2402.03563) that show these foundation models can represent uncertainty in the hidden representations they produce and a model can be trained to predict uncertainty from these representations. This uncertainty could represent the spectral property comparison between the pre-training and evaluation datasets. Though more work needs to be done, porting this work over would allow the application of SPECTRA in these settings. Again if there is large interest in this FAQ I can release a tutorial on this, just raise an issue! 
 
 4. *SPECTRA takes a long time to run is it worth it?*
 
-    The pairwise spectral property comparison is computationally expensive, but only needs to be done once. Generated SPECTRA splits are important resources that should be released to the public so others can utlilize them without spending resources. For more details on the runtime of the method check out the [publication](https://www.biorxiv.org/content/10.1101/2024.02.25.581982v1), specifically section 6 of the results section. The computation can be sped up with cpu cores, which is a feature that will be released.
+    The pairwise spectral property comparison is computationally expensive, but only needs to be done once. Generated SPECTRA splits are important resources that should be released to the public so others can utlilize them without spending resources. For more details on the runtime of the method check out the [publication](https://rdcu.be/d2D0z), specifically section 6 of the results section. The computation can be sped up with cpu cores, which is a feature that will be released.
 
 If there are any other questions please raise them in the issues and I can address them. I'll keep adding to the FAQ as common questions begin to surface.
 
@@ -206,15 +244,20 @@ SPECTRA is under the MIT license found in the LICENSE file in this GitHub reposi
 Please cite this paper when referring to SPECTRA.
 
 ```
-@article {spectra,
-	author = {Yasha Ektefaie and Andrew Shen and Daria Bykova and Maximillian Marin and Marinka Zitnik and Maha R Farhat},
-	title = {Evaluating generalizability of artificial intelligence models for molecular datasets},
-	elocation-id = {2024.02.25.581982},
-	year = {2024},
-	doi = {10.1101/2024.02.25.581982},
-	URL = {https://www.biorxiv.org/content/early/2024/02/28/2024.02.25.581982},
-	eprint = {https://www.biorxiv.org/content/early/2024/02/28/2024.02.25.581982.full.pdf},
-	journal = {bioRxiv}
+@ARTICLE{Ektefaie2024,
+  title     = "Evaluating generalizability of artificial intelligence models
+               for molecular datasets",
+  author    = "Ektefaie, Yasha and Shen, Andrew and Bykova, Daria and Marin,
+               Maximillian G and Zitnik, Marinka and Farhat, Maha",
+  journal   = "Nat. Mach. Intell.",
+  publisher = "Springer Science and Business Media LLC",
+  volume    =  6,
+  number    =  12,
+  pages     = "1512--1524",
+  month     =  dec,
+  year      =  2024,
+  copyright = "https://www.springernature.com/gp/researchers/text-and-data-mining",
+  language  = "en"
 }
 ```
 

diff --git a/spectrae/__init__.py b/spectrae/__init__.py
@@ -1,2 +1,3 @@
-from .spectra import Spectra
-from .dataset import SpectraDataset
+from .spectra import Spectra, Spectra_Property_Graph_Constructor
+from .dataset import SpectraDataset
+from .utils import Spectral_Property_Graph, FlattenedAdjacency, plot_split_stats
diff --git a/spectrae/dataset.py b/spectrae/dataset.py
@@ -1,38 +1,41 @@
 from abc import ABC, abstractmethod
+from typing import List, Dict
 
 class SpectraDataset(ABC):
 
     def __init__(self, input_file, name):
         self.input_file = input_file
         self.name = name
-        self.samples = self.parse(input_file)
-
-    @abstractmethod
-    def sample_to_index(self, idx):
-        """
-        Given a sample, return the data idx
-        """
-        pass
-
+        self.sample_to_index = self.parse(input_file)
+        self.samples = list(self.sample_to_index.keys())
+        self.samples.sort()
+        self.index_map = {value: idx for idx, value in enumerate(self.samples)}
 
     @abstractmethod
-    def parse(self, input_file):
+    def parse(self, input_file: str) -> Dict:
         """
-        Given a dataset file, parse the dataset file. 
-        Make sure there are only unique entries!
+        Given a dataset file, parse the dataset file to return a dictionary mapping a sample ID to the data
         """
-        pass
+        raise NotImplementedError("Must implement parse method to use SpectraDataset, see documentation for more information")
 
-    @abstractmethod
     def __len__(self):
         """
         Return the length of the dataset
         """
-        pass
+        return len(self.samples)
 
-    @abstractmethod
     def __getitem__(self, idx):
         """
         Given a dataset idx, return the element at that index
         """
-        pass
+        if isinstance(idx, int):
+            return self.sample_to_index[self.samples[idx]]
+        return self.sample_to_index[idx]
+
+    def index(self, value):
+        """
+        Given a value, return the index of that value
+        """
+        if value not in self.index_map:
+            raise ValueError(f"{value} not in the dataset")
+        return self.index_map[value]