Skip to content
/ mokume Public

Integrated library to perform feature normalization and multiple abundance computing based on qpx

License

Notifications You must be signed in to change notification settings

bigbio/mokume

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

24 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

mokume

Python application Upload Python Package PyPI version PyPI - Downloads

Why "mokume"?

The name comes from mokume-gane (木目金), a Japanese metalworking technique that fuses multiple metal layers into distinctive patterns - similar to how this library melds peptide intensities into unified protein expression profiles.

Overview

mokume is a comprehensive proteomics quantification library that supports multiple protein quantification methods including iBAQ, Top3, TopN, MaxLFQ, and DirectLFQ. It provides feature/peptide normalization, batch correction, and various summarization strategies for the quantms ecosystem. This library is an evolution of ibaqpy, extended to support a broader range of protein quantification methods beyond iBAQ.

Installation

pip install mokume

With optional extras:

# DirectLFQ support
pip install mokume[directlfq]

# Plotting support (for QC reports and visualizations)
pip install mokume[plotting]

# All optional dependencies
pip install mokume[all]

Or install from source:

git clone https://github.com/bigbio/mokume
cd mokume
pip install .

Using conda:

mamba env create -f environment.yaml
conda activate mokume
pip install .

Library Structure

mokume/
├── core/                    # Core utilities and constants
│   ├── constants.py         # Column names, mappings, utility functions
│   ├── logger.py            # Logging utilities
│   └── write_queue.py       # Async file writing
│
├── model/                   # Data models and enums
│   ├── labeling.py          # TMT/iTRAQ/LFQ labeling types
│   ├── normalization.py     # Normalization method enums
│   ├── organism.py          # Organism metadata (histones, genome size)
│   ├── quantification.py    # Quantification method enums
│   ├── summarization.py     # Summarization strategy enums
│   └── filters.py           # Filter configuration dataclasses
│
├── normalization/           # Normalization implementations
│   ├── feature.py           # Feature-level normalization
│   ├── peptide.py           # Peptide-level normalization pipeline
│   └── protein.py           # Protein-level normalization
│
├── preprocessing/           # Preprocessing filters
│   └── filters/             # Quality control filters
│       ├── base.py          # Base filter class
│       ├── intensity.py     # Intensity-based filters
│       ├── peptide.py       # Peptide-level filters
│       ├── protein.py       # Protein-level filters
│       ├── run_qc.py        # Run/sample QC filters
│       ├── pipeline.py      # Filter pipeline orchestration
│       ├── factory.py       # Filter factory functions
│       └── io.py            # YAML/JSON config loading
│
├── quantification/          # Protein quantification methods
│   ├── base.py              # Abstract base class
│   ├── ibaq.py              # iBAQ implementation
│   ├── top3.py              # Top3 quantification
│   ├── topn.py              # TopN quantification
│   ├── maxlfq.py            # MaxLFQ algorithm (parallelized)
│   ├── directlfq.py         # DirectLFQ wrapper (optional)
│   └── all_peptides.py      # Sum of all peptides
│
├── summarization/           # Intensity summarization strategies
│   ├── base.py              # Abstract base class
│   ├── median.py            # Median summarization
│   ├── mean.py              # Mean summarization
│   └── sum.py               # Sum summarization
│
├── imputation/              # Missing value handling
│   └── methods.py           # Imputation implementations
│
├── postprocessing/          # Data reshaping and correction
│   ├── reshape.py           # Pivot operations (wide/long format)
│   ├── batch_correction.py  # ComBat batch correction
│   └── combiner.py          # Multi-file combining
│
├── plotting/                # Visualization (optional: pip install mokume[plotting])
│   ├── distributions.py     # Distribution and box plots
│   └── pca.py               # PCA and t-SNE plots
│
├── io/                      # Input/Output utilities
│   ├── parquet.py           # Parquet/TSV reading, AnnData creation
│   └── fasta.py             # FASTA file handling
│
├── commands/                # CLI commands
│   ├── features2peptides.py # Feature to peptide conversion
│   ├── peptides2protein.py  # Protein quantification
│   ├── batch_correct.py     # Batch correction
│   └── visualize.py         # t-SNE visualization
│
└── data/                    # Static data resources
    ├── organisms.py         # Organism histone data
    └── organisms.json       # Organism registry

Quantification Methods

Method Description Requires FASTA Class Optional
iBAQ Intensity-Based Absolute Quantification Yes peptides_to_protein() No
Top3 Average of 3 most intense peptides No Top3Quantification No
TopN Average of N most intense peptides No TopNQuantification No
MaxLFQ Delayed normalization with parallelization No MaxLFQQuantification No*
DirectLFQ Intensity traces with hierarchical alignment No DirectLFQQuantification Yes**
Sum Sum of all peptide intensities No AllPeptidesQuantification No

*MaxLFQ automatically uses DirectLFQ when installed for best accuracy, falling back to built-in implementation otherwise.

**DirectLFQ requires optional install: pip install mokume[directlfq]

MaxLFQ Algorithm Details

The MaxLFQQuantification class provides two implementations:

  1. DirectLFQ backend (default when installed): Uses the DirectLFQ package for maximum accuracy with variance-guided pairwise alignment.

  2. Built-in fallback: A parallelized implementation using peptide trace alignment:

    • Aligns peptide intensity traces within each protein using median shifts
    • Aggregates aligned traces using median per sample
    • Scales results to preserve total peptide intensity
    • Achieves ~0.95 Spearman correlation with DIA-NN's MaxLFQ values

Use force_builtin=True to always use the built-in implementation, or check maxlfq.using_directlfq to see which backend is active.

CLI Usage

Peptides to Protein Quantification

# Using iBAQ (default) - requires FASTA
mokume peptides2protein --method ibaq \
    -f proteome.fasta \
    -p peptides.csv \
    -o proteins-ibaq.tsv

# Using Top3 - no FASTA required
mokume peptides2protein --method top3 \
    -p peptides.csv \
    -o proteins-top3.tsv

# Using TopN with N=5
mokume peptides2protein --method topn --topn_n 5 \
    -p peptides.csv \
    -o proteins-top5.tsv

# Using MaxLFQ with parallelization
# Automatically uses DirectLFQ backend if installed, otherwise built-in
mokume peptides2protein --method maxlfq \
    --threads 4 \
    -p peptides.csv \
    -o proteins-maxlfq.tsv

# Using DirectLFQ (requires: pip install mokume[directlfq])
mokume peptides2protein --method directlfq \
    -p peptides.csv \
    -o proteins-directlfq.tsv

# Using Sum of all peptides
mokume peptides2protein --method sum \
    -p peptides.csv \
    -o proteins-sum.tsv

Full iBAQ with TPA and ProteomicRuler

mokume peptides2protein \
    -f proteome.fasta \
    -p peptides.csv \
    -e Trypsin \
    --normalize \
    --tpa \
    --ruler \
    --ploidy 2 \
    --cpc 200 \
    --organism human \
    --output proteins-ibaq.tsv \
    --verbose \
    --qc_report QC.pdf

Features to Peptides

mokume features2peptides \
    -p features.parquet \
    -s experiment.sdrf.tsv \
    --remove_decoy_contaminants \
    --remove_low_frequency_peptides \
    --nmethod median \
    --pnmethod globalMedian \
    --output peptides-norm.csv

Preprocessing Filters

mokume supports comprehensive preprocessing filters via YAML/JSON configuration files or CLI options.

Generate example filter configuration:

mokume features2peptides --generate-filter-config filters.yaml

Use filter configuration file:

mokume features2peptides \
    -p features.parquet \
    -s experiment.sdrf.tsv \
    --filter-config filters.yaml \
    --output peptides-filtered.csv

CLI filter overrides (take precedence over config file):

mokume features2peptides \
    -p features.parquet \
    -s experiment.sdrf.tsv \
    --filter-config filters.yaml \
    --filter-min-intensity 1000 \
    --filter-cv-threshold 0.3 \
    --filter-charge-states "2,3,4" \
    --filter-max-missed-cleavages 2 \
    --output peptides-filtered.csv

CLI-only filtering (no config file):

mokume features2peptides \
    -p features.parquet \
    -s experiment.sdrf.tsv \
    --filter-min-intensity 500 \
    --filter-min-unique-peptides 2 \
    --filter-max-missing-rate 0.5 \
    --output peptides-filtered.csv

Batch Correction

mokume correct-batches \
    -f ibaq_folder/ \
    -p "*ibaq.tsv" \
    -o corrected_ibaq.tsv \
    --export_anndata

t-SNE Visualization

mokume tsne-visualization \
    -f protein_folder/ \
    -o proteins.tsv

Python API

Quantification Methods

import pandas as pd
from mokume.quantification import (
    Top3Quantification,
    TopNQuantification,
    MaxLFQQuantification,
    AllPeptidesQuantification,
    get_quantification_method,
    is_directlfq_available,
    peptides_to_protein,  # iBAQ function
)

# Load peptide data
peptides = pd.read_csv("peptides.csv")

# --- Top3 Quantification ---
top3 = Top3Quantification()
result = top3.quantify(
    peptides,
    protein_column="ProteinName",
    peptide_column="PeptideSequence",
    intensity_column="NormIntensity",
    sample_column="SampleID",
)

# --- TopN Quantification (configurable N) ---
topn = TopNQuantification(n=5)
result = topn.quantify(peptides, protein_column="ProteinName", ...)

# --- MaxLFQ Quantification ---
# Automatically uses DirectLFQ if installed, otherwise falls back to built-in
maxlfq = MaxLFQQuantification(
    min_peptides=2,         # Minimum peptides required for MaxLFQ (uses median for fewer)
    threads=4,              # Use 4 parallel cores (-1 for all cores)
)
result = maxlfq.quantify(peptides, protein_column="ProteinName", ...)

# Check which implementation is being used
print(f"Using DirectLFQ: {maxlfq.using_directlfq}")
print(f"Implementation: {maxlfq.name}")  # "MaxLFQ (DirectLFQ)" or "MaxLFQ (built-in)"

# For best accuracy, install DirectLFQ: pip install mokume[directlfq]

# Force built-in implementation (for testing/comparison)
maxlfq_builtin = MaxLFQQuantification(min_peptides=2, force_builtin=True)

# Run-level quantification (uses built-in implementation)
result = maxlfq.quantify(
    peptides,
    protein_column="ProteinName",
    sample_column="SampleID",
    run_column="Run",  # Optional: quantify at run level instead of sample level
)

# --- DirectLFQ Quantification (standalone, optional dependency) ---
if is_directlfq_available():
    from mokume.quantification import DirectLFQQuantification
    directlfq = DirectLFQQuantification(min_nonan=2)
    result = directlfq.quantify(peptides, protein_column="ProteinName", ...)

# --- Sum of All Peptides ---
sum_quant = AllPeptidesQuantification()
result = sum_quant.quantify(peptides, protein_column="ProteinName", ...)

# --- Factory Function ---
method = get_quantification_method("maxlfq", min_peptides=2, threads=-1)
result = method.quantify(peptides, ...)

# --- Check available methods ---
from mokume.quantification import list_quantification_methods
print(list_quantification_methods())
# {'top3': True, 'topn': True, 'maxlfq': True, 'directlfq': False, 'sum': True}

# --- iBAQ with Full Pipeline ---
peptides_to_protein(
    fasta="proteome.fasta",
    peptides="peptides.csv",
    enzyme="Trypsin",
    normalize=True,
    tpa=True,
    ruler=True,
    ploidy=2,
    cpc=200,
    organism="human",
    output="proteins-ibaq.tsv",
    min_aa=7,
    max_aa=30,
    verbose=True,
    qc_report="QC.pdf",
)

Normalization

from mokume.normalization.peptide import peptide_normalization

# Full peptide normalization pipeline
peptide_normalization(
    parquet="features.parquet",
    sdrf="experiment.sdrf.tsv",
    min_aa=7,
    min_unique=2,
    remove_ids=None,
    remove_decoy_contaminants=True,
    remove_low_frequency_peptides=True,
    output="peptides-norm.csv",
    skip_normalization=False,
    nmethod="median",      # Feature normalization: mean, median, iqr, none
    pnmethod="globalMedian",  # Peptide normalization: globalMedian, conditionMedian, none
    log2=True,
    save_parquet=False,
)

Preprocessing Filters

from mokume.preprocessing.filters import (
    load_filter_config,
    save_filter_config,
    generate_example_config,
    get_filter_pipeline,
    FilterPipeline,
)
from mokume.model.filters import PreprocessingFilterConfig

# Generate example configuration file
generate_example_config("filters.yaml")

# Load configuration from file
config = load_filter_config("filters.yaml")

# Create configuration programmatically
config = PreprocessingFilterConfig(
    name="custom_filters",
    enabled=True,
    log_filtered_counts=True,
)
config.intensity.min_intensity = 1000.0
config.intensity.cv_threshold = 0.3
config.peptide.allowed_charge_states = [2, 3, 4]
config.peptide.exclude_modifications = ["Oxidation"]
config.protein.min_unique_peptides = 2
config.run_qc.max_missing_rate = 0.5

# Apply CLI-style overrides
config.apply_overrides({
    "min_intensity": 500,
    "charge_states": [2, 3],
    "max_missing_rate": 0.3,
})

# Save configuration
save_filter_config(config, "my_filters.yaml")

# Create and use filter pipeline directly
pipeline = get_filter_pipeline(config)

import pandas as pd
df = pd.read_csv("peptides.csv")
filtered_df, results = pipeline.apply(df)

# Check filter results
for result in results:
    print(f"{result.filter_name}: removed {result.removed_count} ({result.removal_rate:.1%})")

# Get pipeline summary
summary = pipeline.summary(results)
print(f"Total removed: {summary['total_removed']} / {summary['total_input']}")

# Use filters with peptide_normalization
from mokume.normalization.peptide import peptide_normalization

peptide_normalization(
    parquet="features.parquet",
    sdrf="experiment.sdrf.tsv",
    output="peptides-filtered.csv",
    nmethod="median",
    pnmethod="globalMedian",
    filter_config=config,  # Pass filter configuration
)

Batch Correction

from mokume.io.parquet import combine_ibaq_tsv_files
from mokume.postprocessing.reshape import pivot_wider, pivot_longer
from mokume.postprocessing.batch_correction import apply_batch_correction

# Load and combine multiple TSV files
df = combine_ibaq_tsv_files("data/", pattern="*ibaq.tsv", sep="\t")

# Reshape to wide format (proteins x samples)
df_wide = pivot_wider(
    df,
    row_name="ProteinName",
    col_name="SampleID",
    values="Ibaq",
    fillna=True
)

# Extract batch IDs from sample names
import pandas as pd
batch_ids = [name.split("-")[0] for name in df_wide.columns]
batch_ids = pd.factorize(batch_ids)[0]

# Apply ComBat batch correction
df_corrected = apply_batch_correction(df_wide, list(batch_ids), kwargs={})

# Reshape back to long format
df_long = pivot_longer(
    df_corrected,
    row_name="ProteinName",
    col_name="SampleID",
    values="IbaqCorrected"
)

Data Reshaping

from mokume.postprocessing.reshape import (
    pivot_wider,
    pivot_longer,
    remove_samples_low_protein_number,
    remove_missing_values,
    describe_expression_metrics,
)

# Long to wide format
df_wide = pivot_wider(df, row_name="ProteinName", col_name="SampleID", values="Ibaq")

# Wide to long format
df_long = pivot_longer(df_wide, row_name="ProteinName", col_name="SampleID", values="Ibaq")

# Quality filtering
df_filtered = remove_samples_low_protein_number(df, min_protein_num=100)
df_filtered = remove_missing_values(df, missingness_percentage=20, expression_column="Ibaq")

# Get expression statistics
metrics = describe_expression_metrics(df)

Creating AnnData Objects

from mokume.io.parquet import create_anndata

# Create AnnData from long-format DataFrame
adata = create_anndata(
    df,
    obs_col="SampleID",           # Observation (sample) column
    var_col="ProteinName",        # Variable (protein) column
    value_col="Ibaq",             # Main data values
    layer_cols=["IbaqNorm", "IbaqLog", "IbaqBec"],  # Additional layers
    obs_metadata_cols=["Condition"],  # Sample metadata
    var_metadata_cols=["GeneName"],   # Protein metadata
)

# Save to h5ad
adata.write("proteins.h5ad")

FASTA Handling

from mokume.io.fasta import (
    load_fasta,
    digest_protein,
    extract_fasta,
    get_protein_molecular_weights,
)

# Load FASTA file
proteins = load_fasta("proteome.fasta")

# Digest a single protein sequence
peptides = digest_protein(
    sequence="MKWVTFISLLFLFSSAYS...",
    enzyme="Trypsin",
    min_aa=7,
    max_aa=30,
)

# Extract info for specific proteins
unique_peptide_counts, mw_dict, found_proteins = extract_fasta(
    fasta="proteome.fasta",
    enzyme="Trypsin",
    proteins=["P12345", "P67890"],
    min_aa=7,
    max_aa=30,
    tpa=True,
)

# Get molecular weights
mw_dict = get_protein_molecular_weights("proteome.fasta", ["P12345", "P67890"])

Organism Metadata

from mokume.model.organism import OrganismDescription

# Get available organisms
organisms = OrganismDescription.registered_organisms()
# ['human', 'mouse', 'yeast', 'drome', 'caeel', 'schpo']

# Get organism description
human = OrganismDescription.get("human")
print(human.genome_size)      # Genome size in base pairs
print(human.histone_entries)  # List of histone protein accessions

Computed Values Reference

Column Description Method
Ibaq Total intensity / theoretical peptides iBAQ
IbaqNorm ibaq / sum(ibaq) per sample iBAQ
IbaqLog 10 + log10(IbaqNorm) iBAQ
IbaqPpb IbaqNorm * 100,000,000 iBAQ
IbaqBec Batch effect corrected iBAQ + ComBat
TPA NormIntensity / MolecularWeight iBAQ
CopyNumber Protein copies per cell ProteomicRuler
Concentration[nM] Protein concentration ProteomicRuler
Top3Intensity Average of top 3 peptides Top3
Top{N}Intensity Average of top N peptides TopN
MaxLFQIntensity MaxLFQ algorithm result MaxLFQ
DirectLFQIntensity DirectLFQ intensity traces DirectLFQ
SumIntensity Sum of all peptides Sum

Normalization Methods

Feature-Level Normalization (--nmethod)

Method Description
median Normalize by median across MS runs
mean Normalize by mean across MS runs
iqr Normalize by interquartile range
none Skip feature normalization

Peptide-Level Normalization (--pnmethod)

Method Description
globalMedian Adjust all samples to global median
conditionMedian Adjust samples within each condition to median
none Skip peptide normalization

Preprocessing Filters

mokume provides a comprehensive filter system for quality control. Filters can be configured via YAML/JSON files or CLI options.

Filter Categories

Intensity Filters

Filter Parameter Default Description
MinIntensityFilter min_intensity 0.0 Remove features below threshold
CVThresholdFilter cv_threshold null Max CV across replicates
ReplicateAgreementFilter min_replicate_agreement 1 Min replicates with detection
QuantileFilter quantile_lower/upper 0.0/1.0 Remove intensity outliers

Peptide Filters

Filter Parameter Default Description
PeptideLengthFilter min/max_peptide_length 7/50 Peptide length range
ChargeStateFilter allowed_charge_states null Allowed charges (e.g., [2,3,4])
ModificationFilter exclude_modifications [] Remove specific modifications
MissedCleavageFilter max_missed_cleavages null Max missed cleavages
SearchScoreFilter min_search_score null Min search engine score
SequencePatternFilter exclude_sequence_patterns [] Regex patterns to exclude

Protein Filters

Filter Parameter Default Description
ContaminantFilter remove_contaminants/decoys true Remove contaminants/decoys
MinPeptideFilter min_unique_peptides 2 Min unique peptides per protein
ProteinFDRFilter fdr_threshold 0.01 Protein-level FDR
CoverageFilter min_coverage 0.0 Min sequence coverage
RazorPeptideFilter razor_peptide_handling "keep" Handle shared peptides

Run/Sample QC Filters

Filter Parameter Default Description
RunIntensityFilter min_total_intensity 0.0 Min total intensity per run
MinFeaturesFilter min_identified_features 0 Min features per run
MissingRateFilter max_missing_rate 1.0 Max missing value rate
SampleCorrelationFilter min_sample_correlation null Min replicate correlation

Example Filter Configurations

We provide several pre-configured filter templates for common use cases in tests/example/filters/:

Configuration Use Case Description
basic_qc.yaml General QC Minimal filtering for standard experiments
stringent_filtering.yaml Publication High-confidence results with strict thresholds
tmt_labeling.yaml TMT/iTRAQ Optimized for multiplexed labeling experiments
dia_analysis.yaml DIA Optimized for DIA-NN, Spectronaut analysis
exploratory_analysis.yaml Exploration Minimal filtering for data exploration

Example: Basic QC Configuration

# basic_qc.yaml - Minimal filtering for standard experiments
name: basic_qc
enabled: true

intensity:
  remove_zero_intensity: true

peptide:
  min_peptide_length: 7
  max_peptide_length: 50

protein:
  min_unique_peptides: 2
  remove_contaminants: true
  remove_decoys: true
  contaminant_patterns:
    - CONTAMINANT
    - ENTRAP
    - DECOY

Use these configurations directly:

mokume features2peptides \
    -p features.parquet \
    --filter-config tests/example/filters/basic_qc.yaml \
    -o peptides.csv

Data Processing Pipeline

Feature Processing (features2peptides)

  1. Parse protein identifiers and retain unique peptides
  2. Remove entries with empty intensity or condition
  3. Filter peptides by minimum amino acids
  4. Remove low-confidence proteins (< min_unique peptides)
  5. Optionally remove decoys, contaminants, and specified proteins
  6. Normalize at feature level between MS runs
  7. Merge peptidoforms across fractions and technical replicates
  8. Normalize at sample level
  9. Remove low-frequency peptides
  10. Assemble peptidoforms to peptides
  11. Optional log2 transformation

iBAQ Calculation

  1. Load peptide intensity data
  2. Extract protein info from FASTA (theoretical peptide counts, MW)
  3. Group peptide intensities by protein, sample, and condition
  4. Sum protein intensities within each group
  5. Normalize by detected peptide count
  6. Divide by theoretical peptide count
  7. Optional: Calculate TPA, copy number, concentration

Citation

Zheng P, Audain E, Webel H, Dai C, Klein J, Hitz MP, Sachsenberg T, Bai M, Perez-Riverol Y. Ibaqpy: A scalable Python package for baseline quantification in proteomics leveraging SDRF metadata. J Proteomics. 2025;317:105440. doi: 10.1016/j.jprot.2025.105440.

Wang H, Dai C, Pfeuffer J, Sachsenberg T, Sanchez A, Bai M, Perez-Riverol Y. Tissue-based absolute quantification using large-scale TMT and LFQ experiments. Proteomics. 2023;23(20):e2300188. doi: 10.1002/pmic.202300188.

License

MIT License - see LICENSE for details.

Credits

About

Integrated library to perform feature normalization and multiple abundance computing based on qpx

Resources

License

Stars

Watchers

Forks

Packages

No packages published