Skip to content

Latest commit

 

History

History
377 lines (264 loc) · 16.1 KB

File metadata and controls

377 lines (264 loc) · 16.1 KB

Understanding Comparison Results - Difference Types

This document explains all possible types of differences that can appear in the comparison results CSV file generated by ccd_sync.py. Each row in the output represents a comparison between a CCD file from Set 1 (wwPDB CCD) and Set 2 (CCP4 Monomer Library).

Output Format

The comparison results CSV contains the following columns:

  • ccd_code: The CCD identifier (e.g., "000", "A1A15")
  • name_identical: Y/N - Whether the chemical component name matches
  • type_identical: Y/N - Whether the component type/group matches
  • atom_identical: Y/N - Whether all atoms match
  • bond_identical: Y/N - Whether all bonds match
  • descriptor_identical: Y/N - Whether all descriptors match
  • overall_identical: Y/N - Whether ALL fields match (all must be Y for overall to be Y)
  • wwpdb_modified_date: Last modification date from Set 1 (wwPDB CCD)
  • ccp4_modified_date: Last commit date from Set 2 (CCP4 Monomer Library)

Detailed Comparison CSV

The create_detailed_comparison.py script creates an enhanced version of the comparison CSV that includes the actual values for fields that differ. This detailed CSV includes additional columns:

  • set1__chem_comp.name, set2__chem_comp.name: The actual name values when names differ
  • set1__chem_comp.type, set2__chem_comp.group: The actual type/group values when types differ
  • set1_atoms, set2_atoms: Only the atoms that differ (formatted as "ATOM_ID(TYPE,CHARGE)")
  • set1_bonds, set2_bonds: Only the bonds that differ (formatted as "ATOM1-ATOM2(ORDER,AROMATIC)")
  • set1_descriptors, set2_descriptors: Only the descriptors that differ

Important features of the detailed comparison:

  • Only shows differences, not the entire set (makes it easier to see what actually differs)
  • Bond atom ordering is normalized (only true differences are shown)
  • Multi-line values are displayed on a single line
  • Bond orders are normalized for display (SINGLE/SING → SING, DOUBLE/DOUB → DOUB)

Understanding the Values

  • Y = Yes, the fields match (identical)
  • N = No, the fields differ (not identical)

The overall_identical column will be Y only if ALL of the following are Y: name, type, atom, bond, and descriptor. If any single field is N, then overall_identical will be N.

Types of Differences

This section explains each type of difference that can occur, regardless of what other differences might be present. A single component can have multiple types of differences simultaneously.

1. Name Differences

What this means: The chemical component name (_chem_comp.name) differs between the two sources.

How it's detected: The script normalizes multi-line values by removing newlines (they're formatting artifacts, not actual content) and normalizing case before comparison, so only real content differences are reported. Extra spaces within the name are preserved and will cause differences if present.

Example (CCD code: P00):

  • Set 1 (wwPDB): (2S)-2-azanyl-4-[(E)-[2-methyl-3-oxidanyl-5-(phosphonooxymethyl)pyridin-4-yl]methylideneamino]oxy-butanoic acid
  • Set 2 (CCP4): (2S)-2-azanyl-4-[(E)-[2-methyl-3-oxidanyl-5-(phosphonooxymethyl)pyridin-4-yl]methylideneamino]oxy-butanoic acid (Note: has extra spaces before "acid" - this is a formatting difference)

Another example (CCD code: GU0):

  • Set 1: "2,3,6-tri-O-sulfonato-beta-D-glucopyranose"
  • Set 2: "2,3,6-TRI-O-SULFONATO-ALPHA-L-GALACT" (Note: different stereochemistry and sugar type)

Common causes:

  • Different naming conventions
  • Spelling variations
  • Truncation or abbreviation differences
  • Different punctuation or formatting (if not normalized)
  • Updates to naming standards over time
  • One source uses a full name while the other uses an abbreviation

Note: Name differences often occur alongside other differences (type, atoms, bonds, etc.). Pure "name only" differences (where structure and all other fields match) are relatively rare in practice.


2. Type/Group Differences

What this means: The component type/group classification differs between the two sources.

How it's detected:

  • Set 1 uses _chem_comp.type
  • Set 2 uses _chem_comp.group

These should represent the same classification (e.g., "L-PEPTIDE LINKING", "D-PEPTIDE LINKING", "NON-POLYMER", etc.), but the values don't match.

Example (CCD code: 2A0):

  • Set 1 (wwPDB): _chem_comp.type = "peptide-like"
  • Set 2 (CCP4): _chem_comp.group = "NON-POLYMER"

Another example (CCD code: 060):

  • Set 1: "D-peptide linking"
  • Set 2: "peptide"

Common causes:

  • Different classification systems
  • Updates to type definitions
  • Missing or null values in one source
  • Reclassification without structural changes

Note: Type differences often occur alongside structural differences (atoms and bonds), as different classifications may reflect different structural representations.


3. Atom Differences

What this means: The atom definitions differ between the two sources.

How it's detected: Atoms are compared as complete sets, where each atom is defined by:

  • atom_id: The identifier for the atom (e.g., "C1", "N2")
  • type_symbol: The element symbol (e.g., "C", "N", "O")
  • charge: The formal charge on the atom

Since atoms are compared as sets, if one source has an extra atom or is missing an atom, this will show as N.

Example (CCD code: 040):

  • Set 1 (wwPDB): Has atoms HO10(H,0) and O10(O,0) (neutral hydroxyl group)
  • Set 2 (CCP4): Has atom O10(O,-1) (ionized, charge -1) and no hydrogen atom HO10

Another example (CCD code: 060):

  • Set 1: Has HXT(H,0), N(N,0), and OXT(O,0)
  • Set 2: Has H3(H,0), N(N,1), and OXT(O,-1) (Different atom identifiers and charges)

Another example (CCD code: 2J0):

  • Set 1: N1(N,0); N12(N,0); N2(N,0); N5(N,0); N8(N,0); N9(N,0); RU(RU,2)
  • Set 2: N1(N,1); N12(N,1); N2(N,1); N5(N,1); N8(N,1); N9(N,1); RU(RU,0.00) (All nitrogens have charge +1 in Set 2, ruthenium has charge 0.00 vs 2)

Common causes:

  • Missing atoms in one source
  • Extra atoms in one source
  • Different atom identifiers (e.g., HXT vs H3)
  • Different charge values
  • Different element types for the same position
  • Different protonation states (e.g., -OH vs -O⁻)

Note: Atom differences often occur alongside bond differences, as missing or extra atoms will also affect bond connectivity.


4. Bond Differences

What this means: The bond definitions differ between the two sources.

How it's detected: Bonds are compared as complete sets, where each bond is defined by:

  • atom_id_1: First atom in the bond
  • atom_id_2: Second atom in the bond
  • value_order/type: Bond order (SING/SINGLE, DOUB/DOUBLE, etc.)
  • pdbx_aromatic_flag/aromatic: Whether the bond is aromatic

Important normalization:

  • The script normalizes bond atom ordering before comparison. Bonds are treated as undirected, so "C-OXT" and "OXT-C" are considered the same bond.
  • Formatting differences like SING vs SINGLE or DOUB vs DOUBLE are automatically normalized and will not appear as differences.
  • Only bonds that are truly different (different atoms, different order, or different aromaticity) are reported as differences.

Example (CCD code: 040):

  • Set 1: Has bond HO10-O10(SING,non-aromatic) (hydrogen bonded to oxygen)
  • Set 2: Missing the HO10-O10 bond (because there's no HO10 atom)

Another example (CCD code: 090): Different bond orders for the same aromatic ring bonds:

  • Set 1: CAC-CAD(DOUB,aromatic); CAC-CAI(SING,aromatic); CAD-CAE(SING,aromatic); CAE-CAF(DOUB,aromatic); ...
  • Set 2: CAC-CAD(SING,aromatic); CAC-CAI(DOUB,aromatic); CAD-CAE(DOUB,aromatic); CAE-CAF(SING,aromatic); ...

This shows that the same aromatic ring bonds have different bond order assignments (single vs double) between the two sources, even though the atoms and aromaticity are the same. This is a common issue with aromatic ring representation where different sources assign different Kekulé structures to the same aromatic system.

Another example (CCD code: 2J0): Many bonds differ in aromaticity assignment:

  • Set 1: Bonds marked as aromatic
  • Set 2: Same bonds marked as non-aromatic

Another example (CCD code: 060):

  • Set 1: HXT-OXT(SING,non-aromatic) (bond between HXT and OXT)
  • Set 2: H3-N(SING,non-aromatic) (bond between H3 and N) (Different bonds due to different atom sets)

Common causes:

  • Different bond orders (single vs. double, etc.)
  • Missing bonds in one source
  • Extra bonds in one source
  • Different aromaticity assignments
  • Different Kekulé structure representations for aromatic rings

Note: Bond differences are the most common type of difference in the dataset (67.9% of all components). They often occur alongside atom differences.


5. Descriptor Differences

What this means: The descriptor definitions differ between the two sources.

How it's detected: Descriptors are compared as complete sets, where each descriptor is defined by:

  • type: Descriptor type (e.g., "SMILES", "InChI")
  • program: Program that generated the descriptor
  • program_version: Version of the program
  • descriptor: The actual descriptor string

Example (CCD code: 190):

  • Set 1: Has minimal descriptors: ( ): ; InChI(InChI 1.03): (empty InChI)
  • Set 2: Has full descriptors: InChI(InChI 1.03): InChI=1S/C31H37BrFNO5S/c1-20(2)15-27(40(37,38)2... (complete InChI string)

Another example (CCD code: 2J0):

  • Set 2: Has extensive descriptors (InChI, SMILES, SMILES_CANONICAL, etc.)
  • Set 1: Has minimal descriptors

Common causes:

  • Different descriptor programs or versions
  • Missing descriptors in one source
  • Extra descriptors in one source
  • Different descriptor formats (though content may be equivalent)
  • Different program versions generating the same descriptor type

Note: The script removes quotes from descriptor values before comparison, so quote differences should not cause issues. Descriptor differences are often just metadata differences and are generally less critical than structural differences.


Completely Identical Files

Pattern: Y,Y,Y,Y,Y,Y

All fields match perfectly between Set 1 and Set 2.

Example from your data:

ccd_code: 000
name_identical: Y
type_identical: Y
atom_identical: Y
bond_identical: Y
descriptor_identical: Y
overall_identical: Y

What this means: The chemical component definition is identical in both sources. The dates may differ (indicating when each source was last updated), but the actual chemical data matches.


Example Combinations

In practice, components often have multiple types of differences simultaneously. Here are some common examples:

Example 1: Atom and Bond Differences (Most Common)

CCD code: 040 - Pattern: Y,Y,N,N,Y,N

  • Name: Matches
  • Type: Matches
  • Atoms: Differ (Set 1 has HO10, Set 2 doesn't)
  • Bonds: Differ (Set 1 has HO10-O10 bond, Set 2 doesn't)
  • Descriptors: Match

This represents a significant structural difference: Set 1 shows O10 as a neutral hydroxyl group (-OH), while Set 2 shows it as a deprotonated carboxylate (-O⁻).

Example 2: Type and Structural Differences

CCD code: 060 - Pattern: Y,N,N,N,Y,N

  • Name: Matches
  • Type: Differs (Set 1: "D-peptide linking", Set 2: "peptide")
  • Atoms: Differ (different atom identifiers and charges)
  • Bonds: Differ (different bonds due to different atom sets)
  • Descriptors: Match

Example 3: Name and Type Differences

CCD code: GU0 - Pattern: N,N,Y,Y,Y,N

  • Name: Differs (different stereochemistry and sugar type)
  • Type: Differs (Set 1: "D-saccharide, beta linking", Set 2: "pyranose")
  • Atoms: Match
  • Bonds: Match
  • Descriptors: Match

The naming and classification differ, but the actual chemical structure is the same.

Example 4: Multiple Field Differences

CCD code: 0H0 - Pattern: Y,N,N,N,N,N

  • Name: Matches
  • Type: Differs
  • Atoms: Differ
  • Bonds: Differ
  • Descriptors: Differ

This represents a case where almost everything differs except the name, suggesting significant divergence between the two sources.

Example 5: Extensive Differences

CCD code: 2J0 - Pattern: N,Y,N,N,N,N

  • Name: Differs (Set 1 has name, Set 2 is empty)
  • Type: Matches
  • Atoms: Differ (different charges)
  • Bonds: Differ (different aromaticity assignments)
  • Descriptors: Differ (Set 2 has extensive descriptors, Set 1 has minimal)

Statistics from Your Dataset

Based on comparison_results_20260108_121910.csv:

  • Total components compared: 1,000
  • Completely identical: Varies by dataset

Breakdown by difference type:

  • Name differences: ~4.8% of components
  • Type differences: ~10.6% of components
  • Atom differences: ~36.0% of components
  • Bond differences: ~67.9% of components (most common)
  • Descriptor differences: ~24.6% of components

Key observations:

  • Bond differences are the most common
  • Atom differences are the second most common
  • Most components with bond differences also have atom differences
  • Multiple types of differences often occur together

Date Information

The dates in the output provide context for when differences might have been introduced:

  • wwpdb_modified_date: When the component was last modified in the wwPDB CCD source
  • ccp4_modified_date: When the component was last committed to the CCP4 Monomer Library on GitHub

Interpreting dates:

  • If wwpdb_modified_date is more recent than ccp4_modified_date, the wwPDB source may have more recent updates
  • If ccp4_modified_date is more recent, the CCP4 source may have more recent updates
  • Large date differences may indicate one source is outdated

Example:

ccd_code: 000
wwpdb_modified_date: 2025-08-29
ccp4_modified_date: 2022-04-05

This shows the wwPDB source was updated in 2025, while the CCP4 source hasn't been updated since 2022. Despite this, the component is identical (all Y values), suggesting the 2025 update didn't change the chemical data.


What to Do with Differences

  1. Review bond differences first - These are the most common and may indicate structural representation issues.

  2. Check atom differences - Often accompany bond differences and may indicate missing or extra atoms.

  3. Investigate type differences - May indicate classification system updates or errors.

  4. Examine name differences - Usually less critical but may indicate naming standard updates.

  5. Review descriptor differences - Often just metadata differences (program versions, etc.) and less critical.

  6. Use dates to prioritize - Components with large date differences may need more urgent attention.


Notes

  • The comparison is case-insensitive (values are normalized to lowercase)
  • Bond order values are automatically mapped (SING ↔ SINGLE, DOUB ↔ DOUBLE)
  • Bond atom ordering is normalized: "C-OXT" and "OXT-C" are treated as the same bond
  • Only true bond differences are reported (bonds that differ only in atom ordering are not shown)
  • Multi-line values (like names) are normalized to single lines for display
  • Quotes are removed from descriptor values before comparison
  • Sets are compared, so order doesn't matter for atoms, bonds, or descriptors
  • Missing fields in one source will cause differences (N values)
  • Set2 bond types are extracted from either _chem_comp_bond.type or _chem_comp_bond.value_order fields

Related Documentation

  • See README_ccd_sync.md for information about running comparisons with ccd_sync.py
  • See README_analyze_comparison_results.md for analyzing these results statistically with analyze_comparison_results.py
  • See create_detailed_comparison.py for creating detailed comparison CSVs with actual difference values