Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bonds() showing empty #137

Open
David-OConnor opened this issue Jan 31, 2025 · 9 comments
Open

bonds() showing empty #137

David-OConnor opened this issue Jan 31, 2025 · 9 comments

Comments

@David-OConnor
Copy link

David-OConnor commented Jan 31, 2025

On all the molecules I've loaded, e.g. from the Protein Data Bank. mmCIF, and PDB format. Atoms load correctly. Any idea? Ty!

@David-OConnor David-OConnor changed the title bonds() showing empty` bonds() showing empty Jan 31, 2025
@David-OConnor
Copy link
Author

Update: Brief code review: bonds() requires bonds (private field) to already be populated; it's not getting populated is the root cause. So, the problem is upstream of bonds().

@David-OConnor
Copy link
Author

David-OConnor commented Feb 1, 2025

Hey - here's working code to get the bonds, using bond len. Standard practice from what I gather. I'm happy if you want to copy+paste this into pdbtbx. I think it may be faster that way than a PR for both of us. I'm happy with it in my application code, or integrated into PDBTBX.

This is rough and likely has low-hanging fruit to improve upon, but it gets the job done to first order.

//! This module creates bonds between protein components. Most macromolecule PDB/CIF files don't include
//! explicit bond information, and the `pdbtbx` library doesn't handle this. Infer bond lengths
//! by comparing each interactomic bond distance, and matching against known amino acid bond lengths.
//!
//! Some info here: https://www.ruppweb.org/Xray/tutorial/protein_structure.htm
//! https://itp.uni-frankfurt.de/~engel/amino.html
//! These lengths are in angstrom.

use std::{collections::HashMap, time::Instant};

use rayon::prelude::*;

use crate::{Atom, Bond, BondType};

// Peptide
// Double bond len of C' to N.
const LEN_CP_N: f64 = 1.33;
const LEN_N_CALPHA: f64 = 1.46;
const LEN_CALPHA_CP: f64 = 1.53;

// Single bond
const LEN_C_C: f64 = 1.54;
const LEN_C_N: f64 = 1.48;
const LEN_C_O: f64 = 1.43;

// Hydrogen
const LEN_OH_OH: f64 = 2.8;
const LEN_NH_OC: f64 = 2.9;
const LEN_OH_OC: f64 = 2.8;

// Bonds to H. Mostly ~1
const LEN_N_H: f64 = 1.00;
const LEN_C_H: f64 = 1.10;
const LEN_O_H: f64 = 1.0; // In water molecules. What is it in proteins?

// If interatomic distance is within this distance of one of our known bond lenghts, consider it to be a bond.
const BOND_LEN_THRESH: f64 = 0.04; // todo: Adjust A/R based on performance.
const GRID_SIZE: f64 = 3.0; // Slightly larger than the largest bond threshold

/// Infer bonds from atom distances. Uses spacial partitioning for efficiency.
pub fn create_bonds(atoms: &[Atom]) -> Vec<Bond> {
    let lens_covalent = vec![
        LEN_CP_N,
        LEN_N_CALPHA,
        LEN_CALPHA_CP,
        LEN_C_C,
        LEN_C_N,
        LEN_C_O,
        LEN_N_H,
        LEN_C_H,
        LEN_O_H,
    ];

    let lens_hydrogen = vec![LEN_OH_OH, LEN_NH_OC, LEN_OH_OC];

    // todo: Paralllize?
    let mut result = Vec::new();

    // We use spacial partitioning, so as not to copmare every pair of atoms.
    let mut grid: HashMap<(i32, i32, i32), Vec<usize>> = HashMap::new();

    // Insert atoms into grid
    for (i, atom) in atoms.iter().enumerate() {
        let grid_pos = (
            (atom.posit.x / GRID_SIZE).floor() as i32,
            (atom.posit.y / GRID_SIZE).floor() as i32,
            (atom.posit.z / GRID_SIZE).floor() as i32,
        );
        grid.entry(grid_pos).or_default().push(i);
    }

    // Check pairs only within nearby bins
    let neighbor_offsets = [
        (0, 0, 0),
        (1, 0, 0),
        (-1, 0, 0),
        (0, 1, 0),
        (0, -1, 0),
        (0, 0, 1),
        (0, 0, -1),
        (1, 1, 0),
        (-1, -1, 0),
        (1, 0, 1),
        (-1, 0, -1),
        (0, 1, 1),
        (0, -1, -1),
        (1, 1, 1),
        (-1, -1, -1),
    ];

    for (&cell, atom_indices) in &grid {
        for offset in &neighbor_offsets {
            let neighbor_cell = (cell.0 + offset.0, cell.1 + offset.1, cell.2 + offset.2);
            if let Some(neighbor_indices) = grid.get(&neighbor_cell) {
                for &i in atom_indices {
                    for &j in neighbor_indices {
                        if i >= j {
                            continue;
                        }

                        let atom_0 = &atoms[i];
                        let atom_1 = &atoms[j];
                        let dist = (atom_0.posit - atom_1.posit).magnitude();

                        for (lens, bond_type) in [
                            (&lens_covalent, BondType::Covalent),
                            (&lens_hydrogen, BondType::Hydrogen),
                        ] {
                            for &bond_len in lens {
                                if (dist - bond_len).abs() < BOND_LEN_THRESH {
                                    result.push(Bond {
                                        bond_type,
                                        posit_0: atom_0.posit,
                                        posit_1: atom_1.posit,
                                    });
                                    break;
                                }
                            }
                        }
                    }
                }
            }
        }
    }

    result
}

@douweschulte
Copy link
Owner

As you have seen I only populated the bonds if these are specifically specified in the file. That is the reason why they are not populated in many cases. The length based detection looks nice, I will look a bit closer when I add it in.

douweschulte added a commit that referenced this issue Feb 5, 2025
@douweschulte
Copy link
Owner

I copied the code in and edited it a bit to comply more with the used patterns. Mostly I removed the atom chunking in favour of usage of the build in rstar rtree which was intended for such work. Let me know if this captures what you needed or if you want to see some more changes.

Also I am not entirely sure how it handles multiple bonds. It does not seem to do any checks on if each atom does not overstate its correct number of bonds. Do you think some logic to enforce this could be needed? It might be a good thing if we could run this code for all test pdbs in a test somewhere, but then I might need a good test to check if all of the bonds are actually sensible. Do you have some knowledge to know how to do that properly?

@David-OConnor
Copy link
Author

David-OConnor commented Feb 5, 2025

Great stuff; thank you!

Some notes from my own experiments using slightly modified code above:

  • Some of the bonds don't work properly. (mostly missing; maybe some false bonds?) Likely due to needing to adjust the known-length library.
  • I also added some work on inferring bond character (double bonds etc). Currently not working correctly, but you get the point.
  • The chunking algo makes what would be straightforward code complicated and hard to understand, but it dramatically improves speed over the naive approach of comparing every atom pair. It sounds like you already have a system to handle that!

This solves the issue; ty! So, what I posted is "good enough to start", but definitely needs some tweaking to make it work in all cases.

Slightly updated example that includes multiple bonds. (But generally broken in that regard):

//! This module creates bonds between protein components. Most macromolecule PDB/CIF files don't include
//! explicit bond information, and the `pdbtbx` library doesn't handle this. Infer bond lengths
//! by comparing each interactomic bond distance, and matching against known amino acid bond lengths.
//!
//! Some info here: https://www.ruppweb.org/Xray/tutorial/protein_structure.htm
//! https://itp.uni-frankfurt.de/~engel/amino.html
//! These lengths are in angstrom.

use std::collections::HashMap;

use crate::molecule::{Atom, Bond, BondCount, BondType};

// Peptide
// Double bond len of C' to N.
const LEN_CP_N: f64 = 1.33;
const LEN_N_CALPHA: f64 = 1.46;
const LEN_CALPHA_CP: f64 = 1.53;

// Single bond
const LEN_C_C: f64 = 1.54;
const LEN_C_N: f64 = 1.48;
const LEN_C_O: f64 = 1.43;

// todo: Found this elsewhere. Likely conflict with C_O above?
const LEN_C_O_DOUBLE: f64 = 1.23;

// todo: MOre sidechain bonds, like carbon-carbon.

// Hydrogen
const LEN_OH_OH: f64 = 2.8;
const LEN_NH_OC: f64 = 2.9;
const LEN_OH_OC: f64 = 2.8;

// Bonds to H. Mostly ~1
const LEN_N_H: f64 = 1.00;
const LEN_C_H: f64 = 1.10;
const LEN_O_H: f64 = 1.0;

// If interatomic distance is within this distance of one of our known bond lenghts, consider it to be a bond.
const BOND_LEN_THRESH: f64 = 0.05; // todo: Adjust A/R based on performance.
const BOND_LEN_THRESH_FINE: f64 = 0.01; // todo: Adjust A/R based on performance.
const GRID_SIZE: f64 = 3.0; // Slightly larger than the largest bond threshold

/// Infer bonds from atom distances. Uses spacial partitioning for efficiency.
/// We Check pairs only within nearby bins
pub fn create_bonds(atoms: &[Atom]) -> Vec<Bond> {
    let lens_covalent = vec![
        LEN_CP_N,
        LEN_N_CALPHA,
        LEN_CALPHA_CP,
        LEN_C_C,
        LEN_C_N,
        LEN_C_O,
        LEN_N_H,
        LEN_C_H,
        LEN_O_H,
        LEN_C_O_DOUBLE,
    ];

    let lens_hydrogen = vec![LEN_OH_OH, LEN_NH_OC, LEN_OH_OC];

    // todo: Paralllize?
    let mut result = Vec::new();

    // We use spacial partitioning, so as not to copmare every pair of atoms.
    let mut grid: HashMap<(i32, i32, i32), Vec<usize>> = HashMap::new();

    for (i, atom) in atoms.iter().enumerate() {
        let grid_pos = (
            (atom.posit.x / GRID_SIZE).floor() as i32,
            (atom.posit.y / GRID_SIZE).floor() as i32,
            (atom.posit.z / GRID_SIZE).floor() as i32,
        );
        grid.entry(grid_pos).or_default().push(i);
    }

    let neighbor_offsets = [
        (0, 0, 0),
        (1, 0, 0),
        (-1, 0, 0),
        (0, 1, 0),
        (0, -1, 0),
        (0, 0, 1),
        (0, 0, -1),
        (1, 1, 0),
        (-1, -1, 0),
        (1, 0, 1),
        (-1, 0, -1),
        (0, 1, 1),
        (0, -1, -1),
        (1, 1, 1),
        (-1, -1, -1),
    ];

    for (&cell, atom_indices) in &grid {
        for offset in &neighbor_offsets {
            let neighbor_cell = (cell.0 + offset.0, cell.1 + offset.1, cell.2 + offset.2);
            if let Some(neighbor_indices) = grid.get(&neighbor_cell) {
                for &i in atom_indices {
                    for &j in neighbor_indices {
                        if i >= j {
                            continue;
                        }

                        let atom_0 = &atoms[i];
                        let atom_1 = &atoms[j];
                        let dist = (atom_0.posit - atom_1.posit).magnitude();

                        for (lens, bond_type) in [
                            (&lens_covalent, BondType::Covalent),
                            (&lens_hydrogen, BondType::Hydrogen),
                        ] {
                            for &bond_len in lens {
                                if (dist - bond_len).abs() < BOND_LEN_THRESH {
                                    // let bond_count = if (dist - LEN_C_O_DOUBLE).abs() < BOND_LEN_THRESH_FINE {
                                    let bond_count =
                                        if (dist - LEN_C_O_DOUBLE).abs() < BOND_LEN_THRESH {
                                            BondCount::Double
                                        } else {
                                            BondCount::Single
                                        };

                                    result.push(Bond {
                                        bond_type,
                                        bond_count,
                                        posit_0: atom_0.posit,
                                        posit_1: atom_1.posit,
                                        is_backbone: atom_0.is_backbone() && atom_1.is_backbone(),
                                    });
                                    break;
                                }
                            }
                        }
                    }
                }
            }
        }
    }

    result
}

Example render:

Image

douweschulte added a commit that referenced this issue Feb 6, 2025
@David-OConnor
Copy link
Author

Sorry to bomboard you on this one. After more thought, I wonder if maybe this is out of scope for this lib after all. I.e, this lib is maybe more tightly focused on translating info in scope of PDB and CIF files only? This sort of bond inference is required for any sort of viewer, but maybe it's too much extrapolation from the file contents to fit here?

Of note, I've made a number of updates to the algo since, including more precise bond lengths, using intermediate data structures etc, enforcing the correct atom type, single/hybrid/double bonds etc.

@douweschulte
Copy link
Owner

It might be somewhat out of the direct scope of the library, but I think it is still basal enough to be used by enough people that it might live in here. I would be happy to update the code based on your new implementation.

Defining scope is always really hard, in this case it might not be something that has to be done in the library because it is information inferred from the file more than information already present in the file. But in this case I would argue that it is a hard enough problem that it is useful to provide a single implementation that we all can build upon, and we all can test in wild. Hopefully this results in less bugs than everyone that needs it doing it on their own.

@David-OConnor
Copy link
Author

David-OConnor commented Feb 10, 2025

I'll post the new code once I've fixed some specific problems. I think the bond situation is interesting in that it's in the file spec, but in the wild, largely (AFAIK) unused.

@David-OConnor
Copy link
Author

Here's the latest. More robust, and works for most cases I've tested with:

//! This module creates bonds between protein components. Most macromolecule PDB/CIF files don't include
//! explicit bond information, and the `pdbtbx` library doesn't handle this. Infer bond lengths
//! by comparing each interactomic bond distance, and matching against known amino acid bond lengths.
//!
//! Some info here: https://www.ruppweb.org/Xray/tutorial/protein_structure.htm
//! https://itp.uni-frankfurt.de/~engel/amino.html
//!
//! All lengths are in angstrom.

use std::collections::{HashMap, HashSet};

use crate::{
    molecule::{
        Atom, Bond,
        BondCount::{self, *},
        BondType::{self, *},
    },
    Element,
    Element::{Carbon, Hydrogen, Nitrogen, Oxygen, Sulfur},
};

struct BondSpecs {
    len: f64,
    elements: (Element, Element),
    count: BondCount,
    bond_type: BondType,
}

impl BondSpecs {
    pub fn new(
        len: f64,
        elements: (Element, Element),
        count: BondCount,
        bond_type: BondType,
    ) -> Self {
        Self {
            len,
            elements,
            count,
            bond_type,
        }
    }
}

// If interatomic distance is within this distance of one of our known bond lenghts, consider it to be a bond.
const BOND_LEN_THRESH: f64 = 0.04; // todo: Adjust A/R based on performance.
const GRID_SIZE: f64 = 1.7; // Slightly larger than the largest bond distancen + thresh.

#[rustfmt::skip]
fn get_specs() -> Vec<BondSpecs> {
    vec![
        // --------------------
        // Carbon–Carbon Bonds
        // --------------------

        // C–C single bond
        // The most frequently encountered bond length for saturated, sp³-hybridized carbons (e.g., in alkanes).
        BondSpecs::new(1.54, (Carbon, Carbon), Single, Covalent),

        // Cα–C′: ~1.50 - 1.52 Å
        BondSpecs::new(1.51, (Carbon, Carbon), Single, Covalent),

        // C–C sp²–sp³ single bond, e.g. connecting Phe's ring to the rest of the atom.
        BondSpecs::new(1.50, (Carbon, Carbon), SingleDoubleHybrid, Covalent),

        // Workaround for Phe's ring in some cases.
        BondSpecs::new(1.47, (Carbon, Carbon), SingleDoubleHybrid, Covalent),
        BondSpecs::new(1.44, (Carbon, Carbon), SingleDoubleHybrid, Covalent),
        BondSpecs::new(1.41, (Carbon, Carbon), SingleDoubleHybrid, Covalent),

        // C-C phenyl (aromatic) ring bond, or benzene ring.
        // Found in alkynes, where carbons are sp-hybridized (linear). ~1.37-1.40 Å
        BondSpecs::new(1.39, (Carbon, Carbon), SingleDoubleHybrid, Covalent),

        // C-C Seems to be required for one fo the Trp rings?
        BondSpecs::new(1.36, (Carbon, Carbon), SingleDoubleHybrid, Covalent),

        // C=C double bond
        // Common in alkenes (sp²-hybridized). Range: ~1.33–1.34 Å
        BondSpecs::new(1.33, (Carbon, Carbon), Double, Covalent),

        // C≡C triple bond
        // Found in alkynes, where carbons are sp-hybridized (linear). ~1.20 Å
        BondSpecs::new(1.20, (Carbon, Carbon), Triple, Covalent),

        // --------------------
        // Carbon–Nitrogen Bonds
        // --------------------

        // C–N single bond
        // Typical for amines or alkyl–amine bonds. ~1.45-1.47 Å
        // Also covers Amide Nitrogen to C-alpha bond in protein backbones.
        BondSpecs::new(1.46, (Carbon, Nitrogen), Single, Covalent),

        // C-N Indole N in 5-member aromatic ring, e.g. Trp. 1.36-1.39
        BondSpecs::new(1.37, (Carbon, Nitrogen), SingleDoubleHybrid, Covalent),

        // C-N (amide). Partial double-bond character due to resonance in the amide.
        BondSpecs::new(1.33, (Carbon, Nitrogen), SingleDoubleHybrid, Covalent),

        // C=N double bond
        // Typical for imines (Schiff bases). ~1.28 Å
        BondSpecs::new(1.28, (Carbon, Nitrogen), Double, Covalent),

        // C≡N triple bond
        // Typical of nitriles (–C≡N). ~1.16 Å
        BondSpecs::new(1.16, (Carbon, Nitrogen), Triple, Covalent),
        // NOTE:
        // In proteins, the amide (peptide) bond between C=O and N has partial double-bond character,
        // and the C–N bond length in an amide is around 1.32–1.33 Å.

        // --------------------
        // Carbon–Oxygen Bonds
        // --------------------

        // C–O single bond
        // Found in alcohols, ethers (sp³–O). ~1.43 Å
        BondSpecs::new(1.43, (Carbon, Oxygen), Single, Covalent),

        // C(phenyl)–O. Phenolic C–O bond often shorter than a typical aliphatic C–O. 1.36-1.38 Å
        BondSpecs::new(1.37, (Carbon, Oxygen), Single, Covalent),

        // C′–O (in –COO⁻). 1.25-1.27 Å
        BondSpecs::new(1.26, (Carbon, Oxygen), Single, Covalent),

        // C=O double bond
        // Typical for carbonyl groups (aldehydes, ketones, carboxylic acids, amides). ~1.21–1.23 Å
        BondSpecs::new(1.22, (Carbon, Oxygen), Double, Covalent),

        // --------------------
        // Carbon–Hydrogen Bonds
        // --------------------
        // todo: Expand this section.

        BondSpecs::new(1.09, (Carbon, Hydrogen), Single, Covalent),

        // 1.01–1.02 Å
        BondSpecs::new(1.01, (Nitrogen, Hydrogen), Single, Covalent),

        // 0.96 – 0.98 Å
        BondSpecs::new(1.01, (Oxygen, Hydrogen), Single, Covalent),


        // Non-protein-backbond bond lengths.

        // 1.34 - 1.35. Example: Cys.
        BondSpecs::new(1.34, (Sulfur, Hydrogen), Single, Covalent),

        // 1.81 - 1.82. Example: Cys.
        BondSpecs::new(1.81, (Sulfur, Carbon), Single, Covalent),
    ]
}

/// This is the business logic of evaluating bond lengths. For a single atom pair.
fn eval_lens(bonds: &mut Vec<Bond>, atoms: &[Atom], i: usize, j: usize, specs: &[BondSpecs]) {
    let atom_0 = &atoms[i];
    let atom_1 = &atoms[j];

    let dist = (atom_0.posit - atom_1.posit).magnitude();

    for spec in specs {
        // This directionality ensures only one bond per atom pair. Otherwise, we'd add two identical
        // ones with swapped atom positions.
        // todo: This only prevents duplicate bonds if the elements are different.
        if !(atom_0.element == spec.elements.0 && atom_1.element == spec.elements.1) {
            continue;
        }

        if (dist - spec.len).abs() < BOND_LEN_THRESH {
            bonds.push(Bond {
                bond_type: spec.bond_type,
                bond_count: spec.count,
                atom_0: i,
                atom_1: j,
                is_backbone: atom_0.is_backbone() && atom_1.is_backbone(),
            });
            break;
        }
    }
}

/// Infer bonds from atom distances. Uses spacial partitioning for efficiency.
/// We Check pairs only within nearby bins.
pub fn create_bonds(atoms: &[Atom]) -> Vec<Bond> {
    // todo: Paralllize?

    let mut result = Vec::new();

    let specs = get_specs();

    // We use spacial partitioning, so as not to copmare every pair of atoms.
    let mut grid: HashMap<(i32, i32, i32), Vec<usize>> = HashMap::new();

    for (i, atom) in atoms.iter().enumerate() {
        let grid_pos = (
            (atom.posit.x / GRID_SIZE).floor() as i32,
            (atom.posit.y / GRID_SIZE).floor() as i32,
            (atom.posit.z / GRID_SIZE).floor() as i32,
        );
        grid.entry(grid_pos).or_default().push(i);
    }

    for (&cell, atom_indices) in &grid {
        for dx in -1..=1 {
            for dy in -1..=1 {
                for dz in -1..=1 {
                    let neighbor_cell = (cell.0 + dx, cell.1 + dy, cell.2 + dz);
                    if let Some(neighbor_indices) = grid.get(&neighbor_cell) {
                        for &i in atom_indices {
                            for &j in neighbor_indices {
                                if i == j {
                                    continue;
                                }
                                eval_lens(&mut result, atoms, i, j, &specs);
                            }
                        }
                    }
                }
            }
        }
    }

    // Remove duplicates, which will only occur in the case of same-element bonds.
    // Retain only the *first* occurrence of each unordered bond pair
    let mut seen = HashSet::new();
    result.retain(|bond| {
        // Sort the pair so that (atom_0, atom_1) and (atom_1, atom_0) are treated as the same key
        let canonical_pair = if bond.atom_0 <= bond.atom_1 {
            (bond.atom_0, bond.atom_1)
        } else {
            (bond.atom_1, bond.atom_0)
        };
        seen.insert(canonical_pair)
    });

    result
}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants