-
Notifications
You must be signed in to change notification settings - Fork 20
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
bonds()
showing empty
#137
Comments
Update: Brief code review: |
Hey - here's working code to get the bonds, using bond len. Standard practice from what I gather. I'm happy if you want to copy+paste this into pdbtbx. I think it may be faster that way than a PR for both of us. I'm happy with it in my application code, or integrated into PDBTBX. This is rough and likely has low-hanging fruit to improve upon, but it gets the job done to first order. //! This module creates bonds between protein components. Most macromolecule PDB/CIF files don't include
//! explicit bond information, and the `pdbtbx` library doesn't handle this. Infer bond lengths
//! by comparing each interactomic bond distance, and matching against known amino acid bond lengths.
//!
//! Some info here: https://www.ruppweb.org/Xray/tutorial/protein_structure.htm
//! https://itp.uni-frankfurt.de/~engel/amino.html
//! These lengths are in angstrom.
use std::{collections::HashMap, time::Instant};
use rayon::prelude::*;
use crate::{Atom, Bond, BondType};
// Peptide
// Double bond len of C' to N.
const LEN_CP_N: f64 = 1.33;
const LEN_N_CALPHA: f64 = 1.46;
const LEN_CALPHA_CP: f64 = 1.53;
// Single bond
const LEN_C_C: f64 = 1.54;
const LEN_C_N: f64 = 1.48;
const LEN_C_O: f64 = 1.43;
// Hydrogen
const LEN_OH_OH: f64 = 2.8;
const LEN_NH_OC: f64 = 2.9;
const LEN_OH_OC: f64 = 2.8;
// Bonds to H. Mostly ~1
const LEN_N_H: f64 = 1.00;
const LEN_C_H: f64 = 1.10;
const LEN_O_H: f64 = 1.0; // In water molecules. What is it in proteins?
// If interatomic distance is within this distance of one of our known bond lenghts, consider it to be a bond.
const BOND_LEN_THRESH: f64 = 0.04; // todo: Adjust A/R based on performance.
const GRID_SIZE: f64 = 3.0; // Slightly larger than the largest bond threshold
/// Infer bonds from atom distances. Uses spacial partitioning for efficiency.
pub fn create_bonds(atoms: &[Atom]) -> Vec<Bond> {
let lens_covalent = vec![
LEN_CP_N,
LEN_N_CALPHA,
LEN_CALPHA_CP,
LEN_C_C,
LEN_C_N,
LEN_C_O,
LEN_N_H,
LEN_C_H,
LEN_O_H,
];
let lens_hydrogen = vec![LEN_OH_OH, LEN_NH_OC, LEN_OH_OC];
// todo: Paralllize?
let mut result = Vec::new();
// We use spacial partitioning, so as not to copmare every pair of atoms.
let mut grid: HashMap<(i32, i32, i32), Vec<usize>> = HashMap::new();
// Insert atoms into grid
for (i, atom) in atoms.iter().enumerate() {
let grid_pos = (
(atom.posit.x / GRID_SIZE).floor() as i32,
(atom.posit.y / GRID_SIZE).floor() as i32,
(atom.posit.z / GRID_SIZE).floor() as i32,
);
grid.entry(grid_pos).or_default().push(i);
}
// Check pairs only within nearby bins
let neighbor_offsets = [
(0, 0, 0),
(1, 0, 0),
(-1, 0, 0),
(0, 1, 0),
(0, -1, 0),
(0, 0, 1),
(0, 0, -1),
(1, 1, 0),
(-1, -1, 0),
(1, 0, 1),
(-1, 0, -1),
(0, 1, 1),
(0, -1, -1),
(1, 1, 1),
(-1, -1, -1),
];
for (&cell, atom_indices) in &grid {
for offset in &neighbor_offsets {
let neighbor_cell = (cell.0 + offset.0, cell.1 + offset.1, cell.2 + offset.2);
if let Some(neighbor_indices) = grid.get(&neighbor_cell) {
for &i in atom_indices {
for &j in neighbor_indices {
if i >= j {
continue;
}
let atom_0 = &atoms[i];
let atom_1 = &atoms[j];
let dist = (atom_0.posit - atom_1.posit).magnitude();
for (lens, bond_type) in [
(&lens_covalent, BondType::Covalent),
(&lens_hydrogen, BondType::Hydrogen),
] {
for &bond_len in lens {
if (dist - bond_len).abs() < BOND_LEN_THRESH {
result.push(Bond {
bond_type,
posit_0: atom_0.posit,
posit_1: atom_1.posit,
});
break;
}
}
}
}
}
}
}
}
result
} |
As you have seen I only populated the bonds if these are specifically specified in the file. That is the reason why they are not populated in many cases. The length based detection looks nice, I will look a bit closer when I add it in. |
I copied the code in and edited it a bit to comply more with the used patterns. Mostly I removed the atom chunking in favour of usage of the build in rstar rtree which was intended for such work. Let me know if this captures what you needed or if you want to see some more changes. Also I am not entirely sure how it handles multiple bonds. It does not seem to do any checks on if each atom does not overstate its correct number of bonds. Do you think some logic to enforce this could be needed? It might be a good thing if we could run this code for all test pdbs in a test somewhere, but then I might need a good test to check if all of the bonds are actually sensible. Do you have some knowledge to know how to do that properly? |
Great stuff; thank you! Some notes from my own experiments using slightly modified code above:
This solves the issue; ty! So, what I posted is "good enough to start", but definitely needs some tweaking to make it work in all cases. Slightly updated example that includes multiple bonds. (But generally broken in that regard): //! This module creates bonds between protein components. Most macromolecule PDB/CIF files don't include
//! explicit bond information, and the `pdbtbx` library doesn't handle this. Infer bond lengths
//! by comparing each interactomic bond distance, and matching against known amino acid bond lengths.
//!
//! Some info here: https://www.ruppweb.org/Xray/tutorial/protein_structure.htm
//! https://itp.uni-frankfurt.de/~engel/amino.html
//! These lengths are in angstrom.
use std::collections::HashMap;
use crate::molecule::{Atom, Bond, BondCount, BondType};
// Peptide
// Double bond len of C' to N.
const LEN_CP_N: f64 = 1.33;
const LEN_N_CALPHA: f64 = 1.46;
const LEN_CALPHA_CP: f64 = 1.53;
// Single bond
const LEN_C_C: f64 = 1.54;
const LEN_C_N: f64 = 1.48;
const LEN_C_O: f64 = 1.43;
// todo: Found this elsewhere. Likely conflict with C_O above?
const LEN_C_O_DOUBLE: f64 = 1.23;
// todo: MOre sidechain bonds, like carbon-carbon.
// Hydrogen
const LEN_OH_OH: f64 = 2.8;
const LEN_NH_OC: f64 = 2.9;
const LEN_OH_OC: f64 = 2.8;
// Bonds to H. Mostly ~1
const LEN_N_H: f64 = 1.00;
const LEN_C_H: f64 = 1.10;
const LEN_O_H: f64 = 1.0;
// If interatomic distance is within this distance of one of our known bond lenghts, consider it to be a bond.
const BOND_LEN_THRESH: f64 = 0.05; // todo: Adjust A/R based on performance.
const BOND_LEN_THRESH_FINE: f64 = 0.01; // todo: Adjust A/R based on performance.
const GRID_SIZE: f64 = 3.0; // Slightly larger than the largest bond threshold
/// Infer bonds from atom distances. Uses spacial partitioning for efficiency.
/// We Check pairs only within nearby bins
pub fn create_bonds(atoms: &[Atom]) -> Vec<Bond> {
let lens_covalent = vec![
LEN_CP_N,
LEN_N_CALPHA,
LEN_CALPHA_CP,
LEN_C_C,
LEN_C_N,
LEN_C_O,
LEN_N_H,
LEN_C_H,
LEN_O_H,
LEN_C_O_DOUBLE,
];
let lens_hydrogen = vec![LEN_OH_OH, LEN_NH_OC, LEN_OH_OC];
// todo: Paralllize?
let mut result = Vec::new();
// We use spacial partitioning, so as not to copmare every pair of atoms.
let mut grid: HashMap<(i32, i32, i32), Vec<usize>> = HashMap::new();
for (i, atom) in atoms.iter().enumerate() {
let grid_pos = (
(atom.posit.x / GRID_SIZE).floor() as i32,
(atom.posit.y / GRID_SIZE).floor() as i32,
(atom.posit.z / GRID_SIZE).floor() as i32,
);
grid.entry(grid_pos).or_default().push(i);
}
let neighbor_offsets = [
(0, 0, 0),
(1, 0, 0),
(-1, 0, 0),
(0, 1, 0),
(0, -1, 0),
(0, 0, 1),
(0, 0, -1),
(1, 1, 0),
(-1, -1, 0),
(1, 0, 1),
(-1, 0, -1),
(0, 1, 1),
(0, -1, -1),
(1, 1, 1),
(-1, -1, -1),
];
for (&cell, atom_indices) in &grid {
for offset in &neighbor_offsets {
let neighbor_cell = (cell.0 + offset.0, cell.1 + offset.1, cell.2 + offset.2);
if let Some(neighbor_indices) = grid.get(&neighbor_cell) {
for &i in atom_indices {
for &j in neighbor_indices {
if i >= j {
continue;
}
let atom_0 = &atoms[i];
let atom_1 = &atoms[j];
let dist = (atom_0.posit - atom_1.posit).magnitude();
for (lens, bond_type) in [
(&lens_covalent, BondType::Covalent),
(&lens_hydrogen, BondType::Hydrogen),
] {
for &bond_len in lens {
if (dist - bond_len).abs() < BOND_LEN_THRESH {
// let bond_count = if (dist - LEN_C_O_DOUBLE).abs() < BOND_LEN_THRESH_FINE {
let bond_count =
if (dist - LEN_C_O_DOUBLE).abs() < BOND_LEN_THRESH {
BondCount::Double
} else {
BondCount::Single
};
result.push(Bond {
bond_type,
bond_count,
posit_0: atom_0.posit,
posit_1: atom_1.posit,
is_backbone: atom_0.is_backbone() && atom_1.is_backbone(),
});
break;
}
}
}
}
}
}
}
}
result
} Example render: |
Sorry to bomboard you on this one. After more thought, I wonder if maybe this is out of scope for this lib after all. I.e, this lib is maybe more tightly focused on translating info in scope of PDB and CIF files only? This sort of bond inference is required for any sort of viewer, but maybe it's too much extrapolation from the file contents to fit here? Of note, I've made a number of updates to the algo since, including more precise bond lengths, using intermediate data structures etc, enforcing the correct atom type, single/hybrid/double bonds etc. |
It might be somewhat out of the direct scope of the library, but I think it is still basal enough to be used by enough people that it might live in here. I would be happy to update the code based on your new implementation. Defining scope is always really hard, in this case it might not be something that has to be done in the library because it is information inferred from the file more than information already present in the file. But in this case I would argue that it is a hard enough problem that it is useful to provide a single implementation that we all can build upon, and we all can test in wild. Hopefully this results in less bugs than everyone that needs it doing it on their own. |
I'll post the new code once I've fixed some specific problems. I think the bond situation is interesting in that it's in the file spec, but in the wild, largely (AFAIK) unused. |
Here's the latest. More robust, and works for most cases I've tested with: //! This module creates bonds between protein components. Most macromolecule PDB/CIF files don't include
//! explicit bond information, and the `pdbtbx` library doesn't handle this. Infer bond lengths
//! by comparing each interactomic bond distance, and matching against known amino acid bond lengths.
//!
//! Some info here: https://www.ruppweb.org/Xray/tutorial/protein_structure.htm
//! https://itp.uni-frankfurt.de/~engel/amino.html
//!
//! All lengths are in angstrom.
use std::collections::{HashMap, HashSet};
use crate::{
molecule::{
Atom, Bond,
BondCount::{self, *},
BondType::{self, *},
},
Element,
Element::{Carbon, Hydrogen, Nitrogen, Oxygen, Sulfur},
};
struct BondSpecs {
len: f64,
elements: (Element, Element),
count: BondCount,
bond_type: BondType,
}
impl BondSpecs {
pub fn new(
len: f64,
elements: (Element, Element),
count: BondCount,
bond_type: BondType,
) -> Self {
Self {
len,
elements,
count,
bond_type,
}
}
}
// If interatomic distance is within this distance of one of our known bond lenghts, consider it to be a bond.
const BOND_LEN_THRESH: f64 = 0.04; // todo: Adjust A/R based on performance.
const GRID_SIZE: f64 = 1.7; // Slightly larger than the largest bond distancen + thresh.
#[rustfmt::skip]
fn get_specs() -> Vec<BondSpecs> {
vec![
// --------------------
// Carbon–Carbon Bonds
// --------------------
// C–C single bond
// The most frequently encountered bond length for saturated, sp³-hybridized carbons (e.g., in alkanes).
BondSpecs::new(1.54, (Carbon, Carbon), Single, Covalent),
// Cα–C′: ~1.50 - 1.52 Å
BondSpecs::new(1.51, (Carbon, Carbon), Single, Covalent),
// C–C sp²–sp³ single bond, e.g. connecting Phe's ring to the rest of the atom.
BondSpecs::new(1.50, (Carbon, Carbon), SingleDoubleHybrid, Covalent),
// Workaround for Phe's ring in some cases.
BondSpecs::new(1.47, (Carbon, Carbon), SingleDoubleHybrid, Covalent),
BondSpecs::new(1.44, (Carbon, Carbon), SingleDoubleHybrid, Covalent),
BondSpecs::new(1.41, (Carbon, Carbon), SingleDoubleHybrid, Covalent),
// C-C phenyl (aromatic) ring bond, or benzene ring.
// Found in alkynes, where carbons are sp-hybridized (linear). ~1.37-1.40 Å
BondSpecs::new(1.39, (Carbon, Carbon), SingleDoubleHybrid, Covalent),
// C-C Seems to be required for one fo the Trp rings?
BondSpecs::new(1.36, (Carbon, Carbon), SingleDoubleHybrid, Covalent),
// C=C double bond
// Common in alkenes (sp²-hybridized). Range: ~1.33–1.34 Å
BondSpecs::new(1.33, (Carbon, Carbon), Double, Covalent),
// C≡C triple bond
// Found in alkynes, where carbons are sp-hybridized (linear). ~1.20 Å
BondSpecs::new(1.20, (Carbon, Carbon), Triple, Covalent),
// --------------------
// Carbon–Nitrogen Bonds
// --------------------
// C–N single bond
// Typical for amines or alkyl–amine bonds. ~1.45-1.47 Å
// Also covers Amide Nitrogen to C-alpha bond in protein backbones.
BondSpecs::new(1.46, (Carbon, Nitrogen), Single, Covalent),
// C-N Indole N in 5-member aromatic ring, e.g. Trp. 1.36-1.39
BondSpecs::new(1.37, (Carbon, Nitrogen), SingleDoubleHybrid, Covalent),
// C-N (amide). Partial double-bond character due to resonance in the amide.
BondSpecs::new(1.33, (Carbon, Nitrogen), SingleDoubleHybrid, Covalent),
// C=N double bond
// Typical for imines (Schiff bases). ~1.28 Å
BondSpecs::new(1.28, (Carbon, Nitrogen), Double, Covalent),
// C≡N triple bond
// Typical of nitriles (–C≡N). ~1.16 Å
BondSpecs::new(1.16, (Carbon, Nitrogen), Triple, Covalent),
// NOTE:
// In proteins, the amide (peptide) bond between C=O and N has partial double-bond character,
// and the C–N bond length in an amide is around 1.32–1.33 Å.
// --------------------
// Carbon–Oxygen Bonds
// --------------------
// C–O single bond
// Found in alcohols, ethers (sp³–O). ~1.43 Å
BondSpecs::new(1.43, (Carbon, Oxygen), Single, Covalent),
// C(phenyl)–O. Phenolic C–O bond often shorter than a typical aliphatic C–O. 1.36-1.38 Å
BondSpecs::new(1.37, (Carbon, Oxygen), Single, Covalent),
// C′–O (in –COO⁻). 1.25-1.27 Å
BondSpecs::new(1.26, (Carbon, Oxygen), Single, Covalent),
// C=O double bond
// Typical for carbonyl groups (aldehydes, ketones, carboxylic acids, amides). ~1.21–1.23 Å
BondSpecs::new(1.22, (Carbon, Oxygen), Double, Covalent),
// --------------------
// Carbon–Hydrogen Bonds
// --------------------
// todo: Expand this section.
BondSpecs::new(1.09, (Carbon, Hydrogen), Single, Covalent),
// 1.01–1.02 Å
BondSpecs::new(1.01, (Nitrogen, Hydrogen), Single, Covalent),
// 0.96 – 0.98 Å
BondSpecs::new(1.01, (Oxygen, Hydrogen), Single, Covalent),
// Non-protein-backbond bond lengths.
// 1.34 - 1.35. Example: Cys.
BondSpecs::new(1.34, (Sulfur, Hydrogen), Single, Covalent),
// 1.81 - 1.82. Example: Cys.
BondSpecs::new(1.81, (Sulfur, Carbon), Single, Covalent),
]
}
/// This is the business logic of evaluating bond lengths. For a single atom pair.
fn eval_lens(bonds: &mut Vec<Bond>, atoms: &[Atom], i: usize, j: usize, specs: &[BondSpecs]) {
let atom_0 = &atoms[i];
let atom_1 = &atoms[j];
let dist = (atom_0.posit - atom_1.posit).magnitude();
for spec in specs {
// This directionality ensures only one bond per atom pair. Otherwise, we'd add two identical
// ones with swapped atom positions.
// todo: This only prevents duplicate bonds if the elements are different.
if !(atom_0.element == spec.elements.0 && atom_1.element == spec.elements.1) {
continue;
}
if (dist - spec.len).abs() < BOND_LEN_THRESH {
bonds.push(Bond {
bond_type: spec.bond_type,
bond_count: spec.count,
atom_0: i,
atom_1: j,
is_backbone: atom_0.is_backbone() && atom_1.is_backbone(),
});
break;
}
}
}
/// Infer bonds from atom distances. Uses spacial partitioning for efficiency.
/// We Check pairs only within nearby bins.
pub fn create_bonds(atoms: &[Atom]) -> Vec<Bond> {
// todo: Paralllize?
let mut result = Vec::new();
let specs = get_specs();
// We use spacial partitioning, so as not to copmare every pair of atoms.
let mut grid: HashMap<(i32, i32, i32), Vec<usize>> = HashMap::new();
for (i, atom) in atoms.iter().enumerate() {
let grid_pos = (
(atom.posit.x / GRID_SIZE).floor() as i32,
(atom.posit.y / GRID_SIZE).floor() as i32,
(atom.posit.z / GRID_SIZE).floor() as i32,
);
grid.entry(grid_pos).or_default().push(i);
}
for (&cell, atom_indices) in &grid {
for dx in -1..=1 {
for dy in -1..=1 {
for dz in -1..=1 {
let neighbor_cell = (cell.0 + dx, cell.1 + dy, cell.2 + dz);
if let Some(neighbor_indices) = grid.get(&neighbor_cell) {
for &i in atom_indices {
for &j in neighbor_indices {
if i == j {
continue;
}
eval_lens(&mut result, atoms, i, j, &specs);
}
}
}
}
}
}
}
// Remove duplicates, which will only occur in the case of same-element bonds.
// Retain only the *first* occurrence of each unordered bond pair
let mut seen = HashSet::new();
result.retain(|bond| {
// Sort the pair so that (atom_0, atom_1) and (atom_1, atom_0) are treated as the same key
let canonical_pair = if bond.atom_0 <= bond.atom_1 {
(bond.atom_0, bond.atom_1)
} else {
(bond.atom_1, bond.atom_0)
};
seen.insert(canonical_pair)
});
result
} |
On all the molecules I've loaded, e.g. from the Protein Data Bank. mmCIF, and PDB format. Atoms load correctly. Any idea? Ty!
The text was updated successfully, but these errors were encountered: