Skip to content

zmcgrath96/ms_map

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

ms_map

A hash map used for the sole purpose of storing mass spectrometry data

Background

What is mass spectrometry data

If you hadn't heard of mass spectrometry, I doubt you will get much out of this datastructure, but feel free to stick around and learn!

Mass spectrometry data is data that is generated from a mass spectrometer (duh). More specifically, this data structure is meant for identify peptides (short sequences of amino acids) from the list of numbers (floats) from a mass spectrometer run. Amino acids are described as a sequence masses.

The primary use of this data structure, to be even MORE specific, is by using singly and doubly charged masses to identify a peptide. The output, at its most basic form, is a list of floating point numbers. It might look something like the following

[99.023, 140.743, 209.887, 288.115, 402.778]

This sequence of floating point numbers describe a particular sequence of amino acids. The way this works is as follows:

Say our sequence of amino acids is: MALW (don't shoot me I know the masses don't make sense). The above sequence of amino acids describe any arrangement of ion types and charges. For the sake of this example, we will limit our scope to what're called b and y ions with possible charges of 1 and 2. The b ions describe amino acids in a left-to-right fashion, and y right-to-left. So for our example sequence, we can break it down to something like:

  b1  b2  b3
M | A | L | W
  y3  y2  y1

So M is described by both a b ion and a y ion. The difference is that b is the mass of M and y is really the mass of ALW. But you can see how these ions complement eachother.

To dive a bit deeper, charges, as the name suggests, are the actual charges of the molecules and amino acid chains from the mass spectrometer. The most common of which are 1 and 2. So we could have up to (in this case) 4 different combinations of ions (b+, b++, y+, y++) that describe each junction.

Hash map with ms data

If you don't know anything about the hash map data structure, look through this wikipedia page.

The hash map works as follows:

Insertions

A pair of a mass (a floating point number) and a sequence (string of amino acids) are added.

insert(mass, spectrum)

The integer value of the mass is used as the key. Then the pair of the mass, sequence, is inserted into the list at the entry of the integer mass.

input: 123.456, ABC
key: 123
inserting value: 123.456, ABC

Any value with the SAME mass is attached to the same mass value to avoid redunancy

table: {123: [(123.456, [ABC])]}
insert: 123.456, XYZ
table: {123: [(123.456, [ABC, XYZ])]}

Values that do NOT have the same mass are not inserted in the same spot

table: {123: [(123.456, [ABC, XYZ])]}
insert: 567.890, LMNOP
table: {123: [(123.456, [ABC, XYZ])], 567: [(567.890, [LMNOP])]}

Lookups

A lookup on the table is a lookup on masses, not sequences. Much like an insertion, the lookup starts with taking the integer value of a mass. This is used as the search key.

search(mass, tolerance)

The tolerance parameter is used to determine how much leeway is allowed when searching for a mass. Once the key is extracted, we can look through a list for values that fall within the upper and lower bounds of the search mass.

search parameters: 123.455, 20
tolerance: 20 ppm of 123.455 -> 0.0025
upper bound: 123.455 + .0025 = 123.4575
lower bound: 123.455 - .0025 =  123.4525
key: 123

table entries at 123: [(123.456, [ABC, XYZ]), (123.999, [YYY])]
table entries at 123 with values between 123.4525 and 123.4575: [(123.456, [ABC, XYZ])]

Then the seqeunces of the values that fall within the mass range are returned

output: [ABC, XYZ]

Installation

To install the python bindings as a package, run

pip install ms_map

To install the source code, run

git clone https://github.com/zmcgrath96/ms_map.git

To build the C++ code and run tests, run

cd ms_map
$ms_map> chmod u+x build_and_test.sh
$ms_map> ./build_and_test.sh

Example

This example is taken from src/main.cpp

#include <iostream>
#include <vector>

#include "MSMap.hpp"

int main(){

    MSMap * map = new MSMap();

    float mass = 316.308;
    std::string sequence = "ZAM";

    std::cout << "Adding mass with sequence to the map\n";
    map->insert(mass, sequence);

    mass = 218.095;
    sequence = "PSM";

    std::cout << "Adding a different mass with a different sequence to the map\n";
    map->insert(mass, sequence);

    std::cout << "Searching for mass that we put in there\n";
    std::vector<std::string> results = map->search((float)316.302, 20);

    std::cout << "Results:\n";
    for (std::string result: results) std::cout << result + " ";
    std::cout << "\n";

    return 1;
}

API

  • void insert(float mass, std::string sequence): insert a mass with a sequence into the table
  • std::vector< std::string > search(float mass, int ppmTol): search for the mass with an upper and lower bound value calculated from the ppm tolerance parameter. Returns all sequences with a mass in this range.

About

No description or website provided.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages