This tool imports metadata from mmCIF files into new metadata-only files or into existing models. It uses the gemmi library, with automatic method detection and method-specific CSV specification files.
Current version: 0.4.0 — see CHANGELOG for release notes.
Protein Data Bank in Europe (PDBe) · pdbe.org
The same content lives in the repository as docs/user-tutorial.html. The site root (docs/index.html) redirects there so https://pdbeurope.github.io/mmcif-metadata-import/ serves the tutorial.
From PyPI (recommended): installs the mmcif-metadata-import command-line tool.
pip install mmcif-metadata-importFrom source (clone the repository): install dependencies, then run with python import_metadata.py using the same arguments as below.
pip install -r requirements.txtA Jupyter notebook provides an interactive form (file upload, checkboxes, run button)—no command line or web hosting needed.
Run in browser (no install):
Click the badge to open the notebook on mybinder.org (repo). The first launch may take a few minutes while the environment builds (gemmi install). Download any output files from the notebook links before closing the tab—Binder sessions are temporary.
Run locally:
- Install notebook dependencies:
pip install -r requirements-notebook.txt- Start Jupyter (
jupyter notebookorjupyter lab), openmetadata_import.ipynb, and run all cells. Use the widgets to upload mmCIF files, select specifications, and run import. Outputs are saved innotebook_output/.
mmcif-metadata-import <input_file> [--xray] [--xray_serial] [--em] [--nmr] [--macromolecules] [--citation] [--authors] [--funding] [--keywords] [-o output_file] [--merge_to_file target_file] [--overwrite-existing] [--log] [--no-macromolecule-safeguards]input_file: Input mmCIF file (supports.cifand.cif.V[ordinal]extensions)--xray: Optional flag to include X-ray specific categories from specs/XRAY.csv--xray_serial: Optional flag to include X-ray serial specific categories from specs/XRAY_SERIAL.csv--em: Optional flag to include electron microscopy specific categories from specs/EM.csv--nmr: Optional flag to include NMR specific categories from specs/NMR.csv--macromolecules: Optional flag to include macromolecules categories from specs/MACROMOLECULES.csv. When used together with--merge_to_file, the tool runs reference-vs-target safeguards so macromolecule metadata is only merged if polymer chains align (see Macromolecule merge safeguards).--no-macromolecule-safeguards: When merging with--macromolecules, skip reference-vs-target polymer checks (use only if you accept inconsistent macromolecule metadata).--citation: Optional flag to include citation categories from specs/CITATION.csv--authors: Optional flag to include author categories; the tool selectsspecs/AUTHORS*.csvfrom the merge target (if--merge_to_file) or the input file (details under--authorsin Optional specification files). Falls back tospecs/AUTHORS.csvif the method cannot be inferred.--funding: Optional flag to include funding categories from specs/FUNDING.csv--keywords: Optional flag to include keyword categories from specs/KEYWORDS.csv-o, --output: Optional output file name (default:[input_name]_metadata.cif)--merge_to_file: Optional file path to merge imported metadata into (instead of creating a new file). Metadata will be added to the first data block of the target file. The output file will be named<originalname>_merged_with_<inputfilename>in the same directory as the target file.--overwrite-existing: Optional; only valid with--merge_to_file. Remove conflicting pairs and loops from the merge target’s first data block, then insert imported metadata (default without this flag: skip tags that already exist in the target).--log: Optional flag to generate a log file with detailed information about the import process. The log file is automatically named based on the output file (same name with.logextension) and placed in the same directory as the output file.
Note: At least one specification file must be provided.
Exit codes (CLI):
- 0 — Success; all requested categories that passed other rules were merged (including macromolecules when applicable).
- 1 — Failure (e.g. unreadable file, no items to import, merge write error).
- 2 — Merge succeeded for non-macromolecule categories, but macromolecule categories from
MACROMOLECULES.csvwere omitted because safeguards failed. Use--logto see details.
Python API: import_metadata() returns an ImportMetadataOutcome named tuple: ok, exit_status (same semantics as above), and optional safeguard_result. Do not rely on a plain boolean return value. Use optional keyword overwrite_existing (default False) to match --overwrite-existing. merge_metadata_to_file() returns MergeMetadataResult (success, skipped and overwritten category/item sets).
Merge Mode: When --merge_to_file is provided, the imported metadata will be merged into the first data block of the specified file. A new file will be created with the name pattern <originalname>_merged_with_<inputfilename> (e.g., if merging target.cif with metadata from input.cif, the output will be target_merged_with_input.cif). The original target file is not modified.
- Default: Metadata items are appended at the end of the first data block (text splice). Categories and items that already exist in the target are not copied; the log lists them as Categories not imported / Items not imported.
- With
--overwrite-existing: Conflicting pairs and any loop that shares a column with the import are removed from the target’s first block, then all imported metadata for the request is added; the merged file is written with gemmi (first block rebuilt; furtherdata_blocks are copied). The log lists CATEGORIES OVERWRITTEN / ITEMS OVERWRITTEN instead of “not imported” for those tags.
If --merge_to_file is not provided, a new metadata file will be created as specified by -o/--output.
Method Validation: The script automatically detects the input file's method and validates method-specific flags. If you try to use --xray on an EM file, the script will warn you and skip the X-ray specification to prevent importing incompatible metadata.
# Basic usage with method-specific files
mmcif-metadata-import input.cif --xray
mmcif-metadata-import input.cif --xray_serial
mmcif-metadata-import input.cif --em
mmcif-metadata-import input.cif --nmr
# With custom output name
mmcif-metadata-import input.cif --xray -o custom_output.cif
# Using only optional specification files
mmcif-metadata-import input.cif --macromolecules
mmcif-metadata-import input.cif --citation --authors
mmcif-metadata-import input.cif --funding --keywords
# Combine method-specific with optional files
mmcif-metadata-import input.cif --em --macromolecules
mmcif-metadata-import input.cif --xray --citation --authors
mmcif-metadata-import input.cif --nmr --funding --keywords
# Multiple method-specific files
mmcif-metadata-import input.cif --xray --xray_serial --em --nmr
# All optional categories
mmcif-metadata-import input.cif --macromolecules --citation --authors --funding --keywords
# Everything together
mmcif-metadata-import input.cif --xray --em --nmr --macromolecules --citation --authors --funding --keywords
# Method validation example (EM file with X-ray flag - X-ray will be skipped)
mmcif-metadata-import em_file.cif --em --xray --macromolecules
# Output: "Warning: Skipping X-ray specification - input file method (EM_MAP_ONLY) doesn't match X-ray method"
# Merge metadata into an existing file (single data block)
mmcif-metadata-import input.cif --xray --merge_to_file target.cif
# Merge metadata into an existing file with multiple data blocks
mmcif-metadata-import input.cif --xray --merge_to_file target_multiple_datablocks.cif
# Metadata will be added to the first data block, before the second data block
# Generate a log file with detailed import information (automatically named input.log)
mmcif-metadata-import input.cif --xray --log
# Combine merge with log file (log file automatically named based on merge output)
mmcif-metadata-import input.cif --xray --merge_to_file target.cif --log
# Log file will be: target_merged_with_input.log (same directory as target)
# Merge and replace metadata that already exists in the target (overwrite mode)
mmcif-metadata-import input.cif --xray --merge_to_file target.cif --overwrite-existing --log
# Macromolecules + merge: safeguards may skip only macromolecule categories (exit code 2)
mmcif-metadata-import reference.cif --macromolecules --merge_to_file target.cif --log
# Disable macromolecule safeguards when merging (not recommended for production)
mmcif-metadata-import reference.cif --macromolecules --merge_to_file target.cif --no-macromolecule-safeguardsWhen --macromolecules and --merge_to_file are both set, the reference file is the positional input_file and the target is --merge_to_file. Before copying categories from specs/MACROMOLECULES.csv, the tool checks that polymer chains in the two files match (same label_asym_id set, compatible residue counts and sequences). If the check fails, only those macromolecule categories are left out of the merge; other requested categories (e.g. --xray + --macromolecules) are still merged, and the CLI exits with code 2.
- Rule codes in logs (
ALIGN-1-ASYMM-SET, etc.) and what each check does:docs/macromolecule-safeguards.md.
The script automatically detects the source method (FROM) from the input mmCIF file based on:
- XRAY:
exptl.method = "X-RAY DIFFRACTION" - NMR:
exptl.method = "SOLUTION NMR" - EM_MAP_ONLY:
exptl.method = "ELECTRON MICROSCOPY"+database_2.database_idcontains "WWPDB" and "EMDB" - EM_MODEL_ONLY:
exptl.method = "ELECTRON MICROSCOPY"+database_2.database_idcontains "WWPDB" and "PDB" - EM_MAP_MODEL:
exptl.method = "ELECTRON MICROSCOPY"+database_2.database_idcontains "WWPDB", "PDB", and "EMDB"
All specification CSV files are located in the specs/ subdirectory.
The script uses simplified method-specific CSV files:
specs/XRAY.csv- X-ray crystallography specific categoriesspecs/XRAY_SERIAL.csv- X-ray serial specific categoriesspecs/EM.csv- Electron microscopy specific categoriesspecs/NMR.csv- Nuclear magnetic resonance specific categories
The script supports several optional flags that add additional categories from separate CSV files. These are merged with the method-specific specification file to provide comprehensive metadata import.
Contains macromolecule-related categories:
_entity,_entity_name_com,_entity_poly,_entity_poly_seq_entity_src_nat,_entity_src_gen,_pdbx_entity_src_syn_struct_ref,_struct_ref_seq,_struct_ref_seq_dif
With --merge_to_file, see Macromolecule merge safeguards. Use --log to record a MACROMOLECULE SAFEGUARDS section when checks run.
Contains citation-related categories:
_citation,_citation_author
Author categories are chosen from a profile mmCIF: the merge target when --merge_to_file is set, otherwise the input file. The profile must allow method detection via _exptl.method (and for EM, _database_2 as in detect_method_from_input). If that fails, the tool falls back to specs/AUTHORS.csv (all categories below).
| Profile | Author categories imported |
|---|---|
Electron Microscopy, no _atom_site loop |
_pdbx_contact_author, _em_author_list (AUTHORS_EM_MAP_ONLY.csv) |
Electron Microscopy, with _atom_site |
_audit_author, _pdbx_contact_author, _em_author_list (AUTHORS_EM_WITH_ATOM_SITE.csv) |
| All other methods (X-ray, NMR, etc.) | _audit_author, _pdbx_contact_author (AUTHORS_DEFAULT.csv) |
Python API: resolve_authors_spec_path(profile_mmCIF_path, spec_dir=None) returns the chosen Path to an authors CSV (same rules as the CLI). block_has_atom_site(block) and EM_METHOD_CODES are also defined in import_metadata.py for reuse.
Trying it locally: python dev/temp_test/author_demo/run_author_demos.py runs sample imports (including merge) and writes logs under dev/temp_test/author_demo/output/.
Contains funding-related categories:
_pdbx_audit_support
Contains keyword-related items:
_struct_keywords.text,_struct_keywords.pdbx_keywords,_struct_keywords.pdbx_details
All optional categories are merged with the method-specific specification file to provide comprehensive metadata information in the output.
Each CSV specification file should contain the following columns:
category: The mmCIF category name (e.g.,_pdbx_contact_author)item: The specific item name within the category (e.g.,id,name_first). Leave empty for category-level specifications.should_import: Whether to include this category/item (Yfor yes,Nfor no)type: Eithercategory(for entire category) oritem(for specific items)
category,item,should_import,type
_pdbx_contact_author,,Y,category
_citation,,Y,category
_struct_keywords,text,Y,item
_struct_keywords,pdbx_keywords,Y,item
_database_2,,N,category
_struct_keywords,entry_id,N,item# Header row
category,item,should_import,type
# Include entire _pdbx_contact_author category (all items)
_pdbx_contact_author,,Y,category
# Include entire _citation category (all items)
_citation,,Y,category
# Include only specific items from _struct_keywords category
_struct_keywords,text,Y,item # Include _struct_keywords.text
_struct_keywords,pdbx_keywords,Y,item # Include _struct_keywords.pdbx_keywords
_struct_keywords,entry_id,N,item # Exclude _struct_keywords.entry_id
# Exclude entire _database_2 category (no items)
_database_2,,N,categoryKey Points:
- Empty
itemcolumn = entire category (usetype=category) - Filled
itemcolumn = specific item (usetype=item) Y= include this category/itemN= exclude this category/item
The script creates a new mmCIF file containing only the specified categories and items from the input file. The output filename follows the pattern [input_name]_metadata.cif.
Output Format: The output file does not include a data_ block declaration line at the beginning. This allows the metadata content to be easily appended to the first data block of an existing mmCIF file. The file starts directly with the metadata categories and items.
When using the --log flag, a detailed log file is automatically generated with the same name as the output file but with a .log extension, placed in the same directory as the output file. For example:
- If output file is
input_metadata.cif, the log file will beinput_metadata.log - If merge output is
target_merged_with_input.cif, the log file will betarget_merged_with_input.log(same directory as the merge output)
The log file contains:
- Requested Categories and Items: Lists all categories and items that were requested to be imported based on the specification files
- Skipped Specifications: Lists any specification files that were skipped (e.g., due to method mismatch) with the reason
- Imported Categories and Items: Lists all categories and items that were successfully imported
- Categories Not Found: Lists categories that were requested but not found in the input file
- Items Not Found: Lists items that were requested but not found in the input file
- Categories Not Imported (merge mode only, default behavior): Categories skipped because they already exist in the target file
- Items Not Imported (merge mode only, default behavior): Items skipped for the same reason
- Categories Overwritten / Items Overwritten (merge mode with
--overwrite-existing): Tags removed from the target and replaced by the import - Summary: Provides counts of requested vs imported categories/items, skipped specifications, categories/items not found, and (for merge mode) categories/items not imported or overwritten
- MACROMOLECULE SAFEGUARDS (when macromolecule checks ran): Pass/fail summary and structured failure details if macromolecule categories were skipped
This log file is useful for debugging and understanding what metadata was imported and what was skipped.
- Optional
--overwrite-existingmerge mode to replace conflicting metadata in the target file --authorspicksAUTHORS_EM_MAP_ONLY,AUTHORS_EM_WITH_ATOM_SITE,AUTHORS_DEFAULT, orAUTHORS(fallback) from the profile mmCIF’s method and_atom_sitepresence- Macromolecule merge safeguards when merging with
--macromolecules; CLI exit code 2 when only macromolecule categories are skipped;import_metadata()returnsImportMetadataOutcome - Supports both
.cifand.cif.V[ordinal]input file extensions - Processes only the first data block in multi-block mmCIF files
- Handles both single items and loop structures in mmCIF files
- Uses CSV format for easy specification management
- Provides detailed error messages for file reading/writing issues
- Optional log file generation for detailed import tracking
Deborah Harrus — Protein Data Bank in Europe (PDBe)
This project is licensed under the Apache License 2.0. See LICENSE for the full text.