This document explains all possible types of differences that can appear in the comparison results CSV file generated by ccd_sync.py. Each row in the output represents a comparison between a CCD file from Set 1 (wwPDB CCD) and Set 2 (CCP4 Monomer Library).
The comparison results CSV contains the following columns:
ccd_code: The CCD identifier (e.g., "000", "A1A15")name_identical: Y/N - Whether the chemical component name matchestype_identical: Y/N - Whether the component type/group matchesatom_identical: Y/N - Whether all atoms matchbond_identical: Y/N - Whether all bonds matchdescriptor_identical: Y/N - Whether all descriptors matchoverall_identical: Y/N - Whether ALL fields match (all must be Y for overall to be Y)wwpdb_modified_date: Last modification date from Set 1 (wwPDB CCD)ccp4_modified_date: Last commit date from Set 2 (CCP4 Monomer Library)
The create_detailed_comparison.py script creates an enhanced version of the comparison CSV that includes the actual values for fields that differ. This detailed CSV includes additional columns:
set1__chem_comp.name,set2__chem_comp.name: The actual name values when names differset1__chem_comp.type,set2__chem_comp.group: The actual type/group values when types differset1_atoms,set2_atoms: Only the atoms that differ (formatted as "ATOM_ID(TYPE,CHARGE)")set1_bonds,set2_bonds: Only the bonds that differ (formatted as "ATOM1-ATOM2(ORDER,AROMATIC)")set1_descriptors,set2_descriptors: Only the descriptors that differ
Important features of the detailed comparison:
- Only shows differences, not the entire set (makes it easier to see what actually differs)
- Bond atom ordering is normalized (only true differences are shown)
- Multi-line values are displayed on a single line
- Bond orders are normalized for display (SINGLE/SING → SING, DOUBLE/DOUB → DOUB)
- Y = Yes, the fields match (identical)
- N = No, the fields differ (not identical)
The overall_identical column will be Y only if ALL of the following are Y: name, type, atom, bond, and descriptor. If any single field is N, then overall_identical will be N.
This section explains each type of difference that can occur, regardless of what other differences might be present. A single component can have multiple types of differences simultaneously.
What this means: The chemical component name (_chem_comp.name) differs between the two sources.
How it's detected: The script normalizes multi-line values by removing newlines (they're formatting artifacts, not actual content) and normalizing case before comparison, so only real content differences are reported. Extra spaces within the name are preserved and will cause differences if present.
Example (CCD code: P00):
- Set 1 (wwPDB):
(2S)-2-azanyl-4-[(E)-[2-methyl-3-oxidanyl-5-(phosphonooxymethyl)pyridin-4-yl]methylideneamino]oxy-butanoic acid - Set 2 (CCP4):
(2S)-2-azanyl-4-[(E)-[2-methyl-3-oxidanyl-5-(phosphonooxymethyl)pyridin-4-yl]methylideneamino]oxy-butanoic acid(Note: has extra spaces before "acid" - this is a formatting difference)
Another example (CCD code: GU0):
- Set 1:
"2,3,6-tri-O-sulfonato-beta-D-glucopyranose" - Set 2:
"2,3,6-TRI-O-SULFONATO-ALPHA-L-GALACT"(Note: different stereochemistry and sugar type)
Common causes:
- Different naming conventions
- Spelling variations
- Truncation or abbreviation differences
- Different punctuation or formatting (if not normalized)
- Updates to naming standards over time
- One source uses a full name while the other uses an abbreviation
Note: Name differences often occur alongside other differences (type, atoms, bonds, etc.). Pure "name only" differences (where structure and all other fields match) are relatively rare in practice.
What this means: The component type/group classification differs between the two sources.
How it's detected:
- Set 1 uses
_chem_comp.type - Set 2 uses
_chem_comp.group
These should represent the same classification (e.g., "L-PEPTIDE LINKING", "D-PEPTIDE LINKING", "NON-POLYMER", etc.), but the values don't match.
Example (CCD code: 2A0):
- Set 1 (wwPDB):
_chem_comp.type="peptide-like" - Set 2 (CCP4):
_chem_comp.group="NON-POLYMER"
Another example (CCD code: 060):
- Set 1:
"D-peptide linking" - Set 2:
"peptide"
Common causes:
- Different classification systems
- Updates to type definitions
- Missing or null values in one source
- Reclassification without structural changes
Note: Type differences often occur alongside structural differences (atoms and bonds), as different classifications may reflect different structural representations.
What this means: The atom definitions differ between the two sources.
How it's detected: Atoms are compared as complete sets, where each atom is defined by:
atom_id: The identifier for the atom (e.g., "C1", "N2")type_symbol: The element symbol (e.g., "C", "N", "O")charge: The formal charge on the atom
Since atoms are compared as sets, if one source has an extra atom or is missing an atom, this will show as N.
Example (CCD code: 040):
- Set 1 (wwPDB): Has atoms
HO10(H,0)andO10(O,0)(neutral hydroxyl group) - Set 2 (CCP4): Has atom
O10(O,-1)(ionized, charge -1) and no hydrogen atomHO10
Another example (CCD code: 060):
- Set 1: Has
HXT(H,0),N(N,0), andOXT(O,0) - Set 2: Has
H3(H,0),N(N,1), andOXT(O,-1)(Different atom identifiers and charges)
Another example (CCD code: 2J0):
- Set 1:
N1(N,0); N12(N,0); N2(N,0); N5(N,0); N8(N,0); N9(N,0); RU(RU,2) - Set 2:
N1(N,1); N12(N,1); N2(N,1); N5(N,1); N8(N,1); N9(N,1); RU(RU,0.00)(All nitrogens have charge +1 in Set 2, ruthenium has charge 0.00 vs 2)
Common causes:
- Missing atoms in one source
- Extra atoms in one source
- Different atom identifiers (e.g., HXT vs H3)
- Different charge values
- Different element types for the same position
- Different protonation states (e.g., -OH vs -O⁻)
Note: Atom differences often occur alongside bond differences, as missing or extra atoms will also affect bond connectivity.
What this means: The bond definitions differ between the two sources.
How it's detected: Bonds are compared as complete sets, where each bond is defined by:
atom_id_1: First atom in the bondatom_id_2: Second atom in the bondvalue_order/type: Bond order (SING/SINGLE, DOUB/DOUBLE, etc.)pdbx_aromatic_flag/aromatic: Whether the bond is aromatic
Important normalization:
- The script normalizes bond atom ordering before comparison. Bonds are treated as undirected, so "C-OXT" and "OXT-C" are considered the same bond.
- Formatting differences like SING vs SINGLE or DOUB vs DOUBLE are automatically normalized and will not appear as differences.
- Only bonds that are truly different (different atoms, different order, or different aromaticity) are reported as differences.
Example (CCD code: 040):
- Set 1: Has bond
HO10-O10(SING,non-aromatic)(hydrogen bonded to oxygen) - Set 2: Missing the
HO10-O10bond (because there's no HO10 atom)
Another example (CCD code: 090): Different bond orders for the same aromatic ring bonds:
- Set 1:
CAC-CAD(DOUB,aromatic); CAC-CAI(SING,aromatic); CAD-CAE(SING,aromatic); CAE-CAF(DOUB,aromatic); ... - Set 2:
CAC-CAD(SING,aromatic); CAC-CAI(DOUB,aromatic); CAD-CAE(DOUB,aromatic); CAE-CAF(SING,aromatic); ...
This shows that the same aromatic ring bonds have different bond order assignments (single vs double) between the two sources, even though the atoms and aromaticity are the same. This is a common issue with aromatic ring representation where different sources assign different Kekulé structures to the same aromatic system.
Another example (CCD code: 2J0): Many bonds differ in aromaticity assignment:
- Set 1: Bonds marked as
aromatic - Set 2: Same bonds marked as
non-aromatic
Another example (CCD code: 060):
- Set 1:
HXT-OXT(SING,non-aromatic)(bond between HXT and OXT) - Set 2:
H3-N(SING,non-aromatic)(bond between H3 and N) (Different bonds due to different atom sets)
Common causes:
- Different bond orders (single vs. double, etc.)
- Missing bonds in one source
- Extra bonds in one source
- Different aromaticity assignments
- Different Kekulé structure representations for aromatic rings
Note: Bond differences are the most common type of difference in the dataset (67.9% of all components). They often occur alongside atom differences.
What this means: The descriptor definitions differ between the two sources.
How it's detected: Descriptors are compared as complete sets, where each descriptor is defined by:
type: Descriptor type (e.g., "SMILES", "InChI")program: Program that generated the descriptorprogram_version: Version of the programdescriptor: The actual descriptor string
Example (CCD code: 190):
- Set 1: Has minimal descriptors:
( ): ; InChI(InChI 1.03):(empty InChI) - Set 2: Has full descriptors:
InChI(InChI 1.03): InChI=1S/C31H37BrFNO5S/c1-20(2)15-27(40(37,38)2...(complete InChI string)
Another example (CCD code: 2J0):
- Set 2: Has extensive descriptors (InChI, SMILES, SMILES_CANONICAL, etc.)
- Set 1: Has minimal descriptors
Common causes:
- Different descriptor programs or versions
- Missing descriptors in one source
- Extra descriptors in one source
- Different descriptor formats (though content may be equivalent)
- Different program versions generating the same descriptor type
Note: The script removes quotes from descriptor values before comparison, so quote differences should not cause issues. Descriptor differences are often just metadata differences and are generally less critical than structural differences.
Pattern: Y,Y,Y,Y,Y,Y
All fields match perfectly between Set 1 and Set 2.
Example from your data:
ccd_code: 000
name_identical: Y
type_identical: Y
atom_identical: Y
bond_identical: Y
descriptor_identical: Y
overall_identical: Y
What this means: The chemical component definition is identical in both sources. The dates may differ (indicating when each source was last updated), but the actual chemical data matches.
In practice, components often have multiple types of differences simultaneously. Here are some common examples:
CCD code: 040 - Pattern: Y,Y,N,N,Y,N
- Name: Matches
- Type: Matches
- Atoms: Differ (Set 1 has HO10, Set 2 doesn't)
- Bonds: Differ (Set 1 has HO10-O10 bond, Set 2 doesn't)
- Descriptors: Match
This represents a significant structural difference: Set 1 shows O10 as a neutral hydroxyl group (-OH), while Set 2 shows it as a deprotonated carboxylate (-O⁻).
CCD code: 060 - Pattern: Y,N,N,N,Y,N
- Name: Matches
- Type: Differs (Set 1: "D-peptide linking", Set 2: "peptide")
- Atoms: Differ (different atom identifiers and charges)
- Bonds: Differ (different bonds due to different atom sets)
- Descriptors: Match
CCD code: GU0 - Pattern: N,N,Y,Y,Y,N
- Name: Differs (different stereochemistry and sugar type)
- Type: Differs (Set 1: "D-saccharide, beta linking", Set 2: "pyranose")
- Atoms: Match
- Bonds: Match
- Descriptors: Match
The naming and classification differ, but the actual chemical structure is the same.
CCD code: 0H0 - Pattern: Y,N,N,N,N,N
- Name: Matches
- Type: Differs
- Atoms: Differ
- Bonds: Differ
- Descriptors: Differ
This represents a case where almost everything differs except the name, suggesting significant divergence between the two sources.
CCD code: 2J0 - Pattern: N,Y,N,N,N,N
- Name: Differs (Set 1 has name, Set 2 is empty)
- Type: Matches
- Atoms: Differ (different charges)
- Bonds: Differ (different aromaticity assignments)
- Descriptors: Differ (Set 2 has extensive descriptors, Set 1 has minimal)
Based on comparison_results_20260108_121910.csv:
- Total components compared: 1,000
- Completely identical: Varies by dataset
Breakdown by difference type:
- Name differences: ~4.8% of components
- Type differences: ~10.6% of components
- Atom differences: ~36.0% of components
- Bond differences: ~67.9% of components (most common)
- Descriptor differences: ~24.6% of components
Key observations:
- Bond differences are the most common
- Atom differences are the second most common
- Most components with bond differences also have atom differences
- Multiple types of differences often occur together
The dates in the output provide context for when differences might have been introduced:
wwpdb_modified_date: When the component was last modified in the wwPDB CCD sourceccp4_modified_date: When the component was last committed to the CCP4 Monomer Library on GitHub
Interpreting dates:
- If
wwpdb_modified_dateis more recent thanccp4_modified_date, the wwPDB source may have more recent updates - If
ccp4_modified_dateis more recent, the CCP4 source may have more recent updates - Large date differences may indicate one source is outdated
Example:
ccd_code: 000
wwpdb_modified_date: 2025-08-29
ccp4_modified_date: 2022-04-05
This shows the wwPDB source was updated in 2025, while the CCP4 source hasn't been updated since 2022. Despite this, the component is identical (all Y values), suggesting the 2025 update didn't change the chemical data.
-
Review bond differences first - These are the most common and may indicate structural representation issues.
-
Check atom differences - Often accompany bond differences and may indicate missing or extra atoms.
-
Investigate type differences - May indicate classification system updates or errors.
-
Examine name differences - Usually less critical but may indicate naming standard updates.
-
Review descriptor differences - Often just metadata differences (program versions, etc.) and less critical.
-
Use dates to prioritize - Components with large date differences may need more urgent attention.
- The comparison is case-insensitive (values are normalized to lowercase)
- Bond order values are automatically mapped (SING ↔ SINGLE, DOUB ↔ DOUBLE)
- Bond atom ordering is normalized: "C-OXT" and "OXT-C" are treated as the same bond
- Only true bond differences are reported (bonds that differ only in atom ordering are not shown)
- Multi-line values (like names) are normalized to single lines for display
- Quotes are removed from descriptor values before comparison
- Sets are compared, so order doesn't matter for atoms, bonds, or descriptors
- Missing fields in one source will cause differences (N values)
- Set2 bond types are extracted from either
_chem_comp_bond.typeor_chem_comp_bond.value_orderfields
- See
README_ccd_sync.mdfor information about running comparisons withccd_sync.py - See
README_analyze_comparison_results.mdfor analyzing these results statistically withanalyze_comparison_results.py - See
create_detailed_comparison.pyfor creating detailed comparison CSVs with actual difference values