Understanding Comparison Results - Difference Types

This document explains all possible types of differences that can appear in the comparison results CSV file generated by ccd_sync.py. Each row in the output represents a comparison between a CCD file from Set 1 (wwPDB CCD) and Set 2 (CCP4 Monomer Library).

Output Format

The comparison results CSV contains the following columns:

ccd_code: The CCD identifier (e.g., "000", "A1A15")
name_identical: Y/N - Whether the chemical component name matches
type_identical: Y/N - Whether the component type/group matches
atom_identical: Y/N - Whether all atoms match
bond_identical: Y/N - Whether all bonds match
descriptor_identical: Y/N - Whether all descriptors match
overall_identical: Y/N - Whether ALL fields match (all must be Y for overall to be Y)
wwpdb_modified_date: Last modification date from Set 1 (wwPDB CCD)
ccp4_modified_date: Last commit date from Set 2 (CCP4 Monomer Library)

Detailed Comparison CSV

The create_detailed_comparison.py script creates an enhanced version of the comparison CSV that includes the actual values for fields that differ. This detailed CSV includes additional columns:

set1__chem_comp.name, set2__chem_comp.name: The actual name values when names differ
set1__chem_comp.type, set2__chem_comp.group: The actual type/group values when types differ
set1_atoms, set2_atoms: Only the atoms that differ (formatted as "ATOM_ID(TYPE,CHARGE)")
set1_bonds, set2_bonds: Only the bonds that differ (formatted as "ATOM1-ATOM2(ORDER,AROMATIC)")
set1_descriptors, set2_descriptors: Only the descriptors that differ

Important features of the detailed comparison:

Only shows differences, not the entire set (makes it easier to see what actually differs)
Bond atom ordering is normalized (only true differences are shown)
Multi-line values are displayed on a single line
Bond orders are normalized for display (SINGLE/SING → SING, DOUBLE/DOUB → DOUB)

Understanding the Values

Y = Yes, the fields match (identical)
N = No, the fields differ (not identical)

The overall_identical column will be Y only if ALL of the following are Y: name, type, atom, bond, and descriptor. If any single field is N, then overall_identical will be N.

Types of Differences

This section explains each type of difference that can occur, regardless of what other differences might be present. A single component can have multiple types of differences simultaneously.

1. Name Differences

What this means: The chemical component name (_chem_comp.name) differs between the two sources.

How it's detected: The script normalizes multi-line values by removing newlines (they're formatting artifacts, not actual content) and normalizing case before comparison, so only real content differences are reported. Extra spaces within the name are preserved and will cause differences if present.

Example (CCD code: P00):

Set 1 (wwPDB): (2S)-2-azanyl-4-[(E)-[2-methyl-3-oxidanyl-5-(phosphonooxymethyl)pyridin-4-yl]methylideneamino]oxy-butanoic acid
Set 2 (CCP4): (2S)-2-azanyl-4-[(E)-[2-methyl-3-oxidanyl-5-(phosphonooxymethyl)pyridin-4-yl]methylideneamino]oxy-butanoic acid (Note: has extra spaces before "acid" - this is a formatting difference)

Another example (CCD code: GU0):

Set 1: "2,3,6-tri-O-sulfonato-beta-D-glucopyranose"
Set 2: "2,3,6-TRI-O-SULFONATO-ALPHA-L-GALACT" (Note: different stereochemistry and sugar type)

Common causes:

Different naming conventions
Spelling variations
Truncation or abbreviation differences
Different punctuation or formatting (if not normalized)
Updates to naming standards over time
One source uses a full name while the other uses an abbreviation

Note: Name differences often occur alongside other differences (type, atoms, bonds, etc.). Pure "name only" differences (where structure and all other fields match) are relatively rare in practice.

2. Type/Group Differences

What this means: The component type/group classification differs between the two sources.

How it's detected:

Set 1 uses _chem_comp.type
Set 2 uses _chem_comp.group

These should represent the same classification (e.g., "L-PEPTIDE LINKING", "D-PEPTIDE LINKING", "NON-POLYMER", etc.), but the values don't match.

Example (CCD code: 2A0):

Set 1 (wwPDB): _chem_comp.type = "peptide-like"
Set 2 (CCP4): _chem_comp.group = "NON-POLYMER"

Another example (CCD code: 060):

Set 1: "D-peptide linking"
Set 2: "peptide"

Common causes:

Different classification systems
Updates to type definitions
Missing or null values in one source
Reclassification without structural changes

Note: Type differences often occur alongside structural differences (atoms and bonds), as different classifications may reflect different structural representations.

3. Atom Differences

What this means: The atom definitions differ between the two sources.

How it's detected: Atoms are compared as complete sets, where each atom is defined by:

atom_id: The identifier for the atom (e.g., "C1", "N2")
type_symbol: The element symbol (e.g., "C", "N", "O")
charge: The formal charge on the atom

Since atoms are compared as sets, if one source has an extra atom or is missing an atom, this will show as N.

Example (CCD code: 040):

Set 1 (wwPDB): Has atoms HO10(H,0) and O10(O,0) (neutral hydroxyl group)
Set 2 (CCP4): Has atom O10(O,-1) (ionized, charge -1) and no hydrogen atom HO10

Another example (CCD code: 060):

Set 1: Has HXT(H,0), N(N,0), and OXT(O,0)
Set 2: Has H3(H,0), N(N,1), and OXT(O,-1) (Different atom identifiers and charges)

Another example (CCD code: 2J0):

Set 1: N1(N,0); N12(N,0); N2(N,0); N5(N,0); N8(N,0); N9(N,0); RU(RU,2)
Set 2: N1(N,1); N12(N,1); N2(N,1); N5(N,1); N8(N,1); N9(N,1); RU(RU,0.00) (All nitrogens have charge +1 in Set 2, ruthenium has charge 0.00 vs 2)

Common causes:

Missing atoms in one source
Extra atoms in one source
Different atom identifiers (e.g., HXT vs H3)
Different charge values
Different element types for the same position
Different protonation states (e.g., -OH vs -O⁻)

Note: Atom differences often occur alongside bond differences, as missing or extra atoms will also affect bond connectivity.

4. Bond Differences

What this means: The bond definitions differ between the two sources.

How it's detected: Bonds are compared as complete sets, where each bond is defined by:

atom_id_1: First atom in the bond
atom_id_2: Second atom in the bond
value_order/type: Bond order (SING/SINGLE, DOUB/DOUBLE, etc.)
pdbx_aromatic_flag/aromatic: Whether the bond is aromatic

Important normalization:

The script normalizes bond atom ordering before comparison. Bonds are treated as undirected, so "C-OXT" and "OXT-C" are considered the same bond.
Formatting differences like SING vs SINGLE or DOUB vs DOUBLE are automatically normalized and will not appear as differences.
Only bonds that are truly different (different atoms, different order, or different aromaticity) are reported as differences.

Example (CCD code: 040):

Set 1: Has bond HO10-O10(SING,non-aromatic) (hydrogen bonded to oxygen)
Set 2: Missing the HO10-O10 bond (because there's no HO10 atom)

Another example (CCD code: 090): Different bond orders for the same aromatic ring bonds:

Set 1: CAC-CAD(DOUB,aromatic); CAC-CAI(SING,aromatic); CAD-CAE(SING,aromatic); CAE-CAF(DOUB,aromatic); ...
Set 2: CAC-CAD(SING,aromatic); CAC-CAI(DOUB,aromatic); CAD-CAE(DOUB,aromatic); CAE-CAF(SING,aromatic); ...

This shows that the same aromatic ring bonds have different bond order assignments (single vs double) between the two sources, even though the atoms and aromaticity are the same. This is a common issue with aromatic ring representation where different sources assign different Kekulé structures to the same aromatic system.

Another example (CCD code: 2J0): Many bonds differ in aromaticity assignment:

Set 1: Bonds marked as aromatic
Set 2: Same bonds marked as non-aromatic

Another example (CCD code: 060):

Set 1: HXT-OXT(SING,non-aromatic) (bond between HXT and OXT)
Set 2: H3-N(SING,non-aromatic) (bond between H3 and N) (Different bonds due to different atom sets)

Common causes:

Different bond orders (single vs. double, etc.)
Missing bonds in one source
Extra bonds in one source
Different aromaticity assignments
Different Kekulé structure representations for aromatic rings

Note: Bond differences are the most common type of difference in the dataset (67.9% of all components). They often occur alongside atom differences.

5. Descriptor Differences

What this means: The descriptor definitions differ between the two sources.

How it's detected: Descriptors are compared as complete sets, where each descriptor is defined by:

type: Descriptor type (e.g., "SMILES", "InChI")
program: Program that generated the descriptor
program_version: Version of the program
descriptor: The actual descriptor string

Example (CCD code: 190):

Set 1: Has minimal descriptors: ( ): ; InChI(InChI 1.03): (empty InChI)
Set 2: Has full descriptors: InChI(InChI 1.03): InChI=1S/C31H37BrFNO5S/c1-20(2)15-27(40(37,38)2... (complete InChI string)

Another example (CCD code: 2J0):

Set 2: Has extensive descriptors (InChI, SMILES, SMILES_CANONICAL, etc.)
Set 1: Has minimal descriptors

Common causes:

Different descriptor programs or versions
Missing descriptors in one source
Extra descriptors in one source
Different descriptor formats (though content may be equivalent)
Different program versions generating the same descriptor type

Note: The script removes quotes from descriptor values before comparison, so quote differences should not cause issues. Descriptor differences are often just metadata differences and are generally less critical than structural differences.

Completely Identical Files

Pattern: Y,Y,Y,Y,Y,Y

All fields match perfectly between Set 1 and Set 2.

Example from your data:

ccd_code: 000
name_identical: Y
type_identical: Y
atom_identical: Y
bond_identical: Y
descriptor_identical: Y
overall_identical: Y

What this means: The chemical component definition is identical in both sources. The dates may differ (indicating when each source was last updated), but the actual chemical data matches.

Example Combinations

In practice, components often have multiple types of differences simultaneously. Here are some common examples:

Example 1: Atom and Bond Differences (Most Common)

CCD code: 040 - Pattern: Y,Y,N,N,Y,N

Name: Matches
Type: Matches
Atoms: Differ (Set 1 has HO10, Set 2 doesn't)
Bonds: Differ (Set 1 has HO10-O10 bond, Set 2 doesn't)
Descriptors: Match

This represents a significant structural difference: Set 1 shows O10 as a neutral hydroxyl group (-OH), while Set 2 shows it as a deprotonated carboxylate (-O⁻).

Example 2: Type and Structural Differences

CCD code: 060 - Pattern: Y,N,N,N,Y,N

Name: Matches
Type: Differs (Set 1: "D-peptide linking", Set 2: "peptide")
Atoms: Differ (different atom identifiers and charges)
Bonds: Differ (different bonds due to different atom sets)
Descriptors: Match

Example 3: Name and Type Differences

CCD code: GU0 - Pattern: N,N,Y,Y,Y,N

Name: Differs (different stereochemistry and sugar type)
Type: Differs (Set 1: "D-saccharide, beta linking", Set 2: "pyranose")
Atoms: Match
Bonds: Match
Descriptors: Match

The naming and classification differ, but the actual chemical structure is the same.

Example 4: Multiple Field Differences

CCD code: 0H0 - Pattern: Y,N,N,N,N,N

Name: Matches
Type: Differs
Atoms: Differ
Bonds: Differ
Descriptors: Differ

This represents a case where almost everything differs except the name, suggesting significant divergence between the two sources.

Example 5: Extensive Differences

CCD code: 2J0 - Pattern: N,Y,N,N,N,N

Name: Differs (Set 1 has name, Set 2 is empty)
Type: Matches
Atoms: Differ (different charges)
Bonds: Differ (different aromaticity assignments)
Descriptors: Differ (Set 2 has extensive descriptors, Set 1 has minimal)

Statistics from Your Dataset

Based on comparison_results_20260108_121910.csv:

Total components compared: 1,000
Completely identical: Varies by dataset

Breakdown by difference type:

Name differences: ~4.8% of components
Type differences: ~10.6% of components
Atom differences: ~36.0% of components
Bond differences: ~67.9% of components (most common)
Descriptor differences: ~24.6% of components

Key observations:

Bond differences are the most common
Atom differences are the second most common
Most components with bond differences also have atom differences
Multiple types of differences often occur together

Date Information

The dates in the output provide context for when differences might have been introduced:

wwpdb_modified_date: When the component was last modified in the wwPDB CCD source
ccp4_modified_date: When the component was last committed to the CCP4 Monomer Library on GitHub

Interpreting dates:

If wwpdb_modified_date is more recent than ccp4_modified_date, the wwPDB source may have more recent updates
If ccp4_modified_date is more recent, the CCP4 source may have more recent updates
Large date differences may indicate one source is outdated

Example:

ccd_code: 000
wwpdb_modified_date: 2025-08-29
ccp4_modified_date: 2022-04-05

This shows the wwPDB source was updated in 2025, while the CCP4 source hasn't been updated since 2022. Despite this, the component is identical (all Y values), suggesting the 2025 update didn't change the chemical data.

What to Do with Differences

Review bond differences first - These are the most common and may indicate structural representation issues.
Check atom differences - Often accompany bond differences and may indicate missing or extra atoms.
Investigate type differences - May indicate classification system updates or errors.
Examine name differences - Usually less critical but may indicate naming standard updates.
Review descriptor differences - Often just metadata differences (program versions, etc.) and less critical.
Use dates to prioritize - Components with large date differences may need more urgent attention.

Notes

The comparison is case-insensitive (values are normalized to lowercase)
Bond order values are automatically mapped (SING ↔ SINGLE, DOUB ↔ DOUBLE)
Bond atom ordering is normalized: "C-OXT" and "OXT-C" are treated as the same bond
Only true bond differences are reported (bonds that differ only in atom ordering are not shown)
Multi-line values (like names) are normalized to single lines for display
Quotes are removed from descriptor values before comparison
Sets are compared, so order doesn't matter for atoms, bonds, or descriptors
Missing fields in one source will cause differences (N values)
Set2 bond types are extracted from either _chem_comp_bond.type or _chem_comp_bond.value_order fields

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Understanding Comparison Results - Difference Types

Output Format

Detailed Comparison CSV

Understanding the Values

Types of Differences

1. Name Differences

2. Type/Group Differences

3. Atom Differences

4. Bond Differences

5. Descriptor Differences

Completely Identical Files

Example Combinations

Example 1: Atom and Bond Differences (Most Common)

Example 2: Type and Structural Differences

Example 3: Name and Type Differences

Example 4: Multiple Field Differences

Example 5: Extensive Differences

Statistics from Your Dataset

Date Information

What to Do with Differences

Notes

Related Documentation

FilesExpand file tree

README_comparison_differences.md

Latest commit

History

README_comparison_differences.md

File metadata and controls

Understanding Comparison Results - Difference Types

Output Format

Detailed Comparison CSV

Understanding the Values

Types of Differences

1. Name Differences

2. Type/Group Differences

3. Atom Differences

4. Bond Differences

5. Descriptor Differences

Completely Identical Files

Example Combinations

Example 1: Atom and Bond Differences (Most Common)

Example 2: Type and Structural Differences

Example 3: Name and Type Differences

Example 4: Multiple Field Differences

Example 5: Extensive Differences

Statistics from Your Dataset

Date Information

What to Do with Differences

Notes

Related Documentation