The name comes from mokume-gane (木目金), a Japanese metalworking technique that fuses multiple metal layers into distinctive patterns - similar to how this library melds peptide intensities into unified protein expression profiles.
mokume is a comprehensive proteomics quantification library that supports multiple protein quantification methods including iBAQ, Top3, TopN, MaxLFQ, and DirectLFQ. It provides feature/peptide normalization, batch correction, and various summarization strategies for the quantms ecosystem. This library is an evolution of ibaqpy, extended to support a broader range of protein quantification methods beyond iBAQ.
pip install mokumeWith optional extras:
# DirectLFQ support
pip install mokume[directlfq]
# Plotting support (for QC reports and visualizations)
pip install mokume[plotting]
# All optional dependencies
pip install mokume[all]Or install from source:
git clone https://github.com/bigbio/mokume
cd mokume
pip install .Using conda:
mamba env create -f environment.yaml
conda activate mokume
pip install .mokume/
├── core/ # Core utilities and constants
│ ├── constants.py # Column names, mappings, utility functions
│ ├── logger.py # Logging utilities
│ └── write_queue.py # Async file writing
│
├── model/ # Data models and enums
│ ├── labeling.py # TMT/iTRAQ/LFQ labeling types
│ ├── normalization.py # Normalization method enums
│ ├── organism.py # Organism metadata (histones, genome size)
│ ├── quantification.py # Quantification method enums
│ ├── summarization.py # Summarization strategy enums
│ └── filters.py # Filter configuration dataclasses
│
├── normalization/ # Normalization implementations
│ ├── feature.py # Feature-level normalization
│ ├── peptide.py # Peptide-level normalization pipeline
│ └── protein.py # Protein-level normalization
│
├── preprocessing/ # Preprocessing filters
│ └── filters/ # Quality control filters
│ ├── base.py # Base filter class
│ ├── intensity.py # Intensity-based filters
│ ├── peptide.py # Peptide-level filters
│ ├── protein.py # Protein-level filters
│ ├── run_qc.py # Run/sample QC filters
│ ├── pipeline.py # Filter pipeline orchestration
│ ├── factory.py # Filter factory functions
│ └── io.py # YAML/JSON config loading
│
├── quantification/ # Protein quantification methods
│ ├── base.py # Abstract base class
│ ├── ibaq.py # iBAQ implementation
│ ├── top3.py # Top3 quantification
│ ├── topn.py # TopN quantification
│ ├── maxlfq.py # MaxLFQ algorithm (parallelized)
│ ├── directlfq.py # DirectLFQ wrapper (optional)
│ └── all_peptides.py # Sum of all peptides
│
├── summarization/ # Intensity summarization strategies
│ ├── base.py # Abstract base class
│ ├── median.py # Median summarization
│ ├── mean.py # Mean summarization
│ └── sum.py # Sum summarization
│
├── imputation/ # Missing value handling
│ └── methods.py # Imputation implementations
│
├── postprocessing/ # Data reshaping and correction
│ ├── reshape.py # Pivot operations (wide/long format)
│ ├── batch_correction.py # ComBat batch correction
│ └── combiner.py # Multi-file combining
│
├── plotting/ # Visualization (optional: pip install mokume[plotting])
│ ├── distributions.py # Distribution and box plots
│ └── pca.py # PCA and t-SNE plots
│
├── io/ # Input/Output utilities
│ ├── parquet.py # Parquet/TSV reading, AnnData creation
│ └── fasta.py # FASTA file handling
│
├── commands/ # CLI commands
│ ├── features2peptides.py # Feature to peptide conversion
│ ├── peptides2protein.py # Protein quantification
│ ├── batch_correct.py # Batch correction
│ └── visualize.py # t-SNE visualization
│
└── data/ # Static data resources
├── organisms.py # Organism histone data
└── organisms.json # Organism registry
| Method | Description | Requires FASTA | Class | Optional |
|---|---|---|---|---|
| iBAQ | Intensity-Based Absolute Quantification | Yes | peptides_to_protein() |
No |
| Top3 | Average of 3 most intense peptides | No | Top3Quantification |
No |
| TopN | Average of N most intense peptides | No | TopNQuantification |
No |
| MaxLFQ | Delayed normalization with parallelization | No | MaxLFQQuantification |
No* |
| DirectLFQ | Intensity traces with hierarchical alignment | No | DirectLFQQuantification |
Yes** |
| Sum | Sum of all peptide intensities | No | AllPeptidesQuantification |
No |
*MaxLFQ automatically uses DirectLFQ when installed for best accuracy, falling back to built-in implementation otherwise.
**DirectLFQ requires optional install: pip install mokume[directlfq]
The MaxLFQQuantification class provides two implementations:
-
DirectLFQ backend (default when installed): Uses the DirectLFQ package for maximum accuracy with variance-guided pairwise alignment.
-
Built-in fallback: A parallelized implementation using peptide trace alignment:
- Aligns peptide intensity traces within each protein using median shifts
- Aggregates aligned traces using median per sample
- Scales results to preserve total peptide intensity
- Achieves ~0.95 Spearman correlation with DIA-NN's MaxLFQ values
Use force_builtin=True to always use the built-in implementation, or check maxlfq.using_directlfq to see which backend is active.
# Using iBAQ (default) - requires FASTA
mokume peptides2protein --method ibaq \
-f proteome.fasta \
-p peptides.csv \
-o proteins-ibaq.tsv
# Using Top3 - no FASTA required
mokume peptides2protein --method top3 \
-p peptides.csv \
-o proteins-top3.tsv
# Using TopN with N=5
mokume peptides2protein --method topn --topn_n 5 \
-p peptides.csv \
-o proteins-top5.tsv
# Using MaxLFQ with parallelization
# Automatically uses DirectLFQ backend if installed, otherwise built-in
mokume peptides2protein --method maxlfq \
--threads 4 \
-p peptides.csv \
-o proteins-maxlfq.tsv
# Using DirectLFQ (requires: pip install mokume[directlfq])
mokume peptides2protein --method directlfq \
-p peptides.csv \
-o proteins-directlfq.tsv
# Using Sum of all peptides
mokume peptides2protein --method sum \
-p peptides.csv \
-o proteins-sum.tsvmokume peptides2protein \
-f proteome.fasta \
-p peptides.csv \
-e Trypsin \
--normalize \
--tpa \
--ruler \
--ploidy 2 \
--cpc 200 \
--organism human \
--output proteins-ibaq.tsv \
--verbose \
--qc_report QC.pdfmokume features2peptides \
-p features.parquet \
-s experiment.sdrf.tsv \
--remove_decoy_contaminants \
--remove_low_frequency_peptides \
--nmethod median \
--pnmethod globalMedian \
--output peptides-norm.csvmokume supports comprehensive preprocessing filters via YAML/JSON configuration files or CLI options.
Generate example filter configuration:
mokume features2peptides --generate-filter-config filters.yamlUse filter configuration file:
mokume features2peptides \
-p features.parquet \
-s experiment.sdrf.tsv \
--filter-config filters.yaml \
--output peptides-filtered.csvCLI filter overrides (take precedence over config file):
mokume features2peptides \
-p features.parquet \
-s experiment.sdrf.tsv \
--filter-config filters.yaml \
--filter-min-intensity 1000 \
--filter-cv-threshold 0.3 \
--filter-charge-states "2,3,4" \
--filter-max-missed-cleavages 2 \
--output peptides-filtered.csvCLI-only filtering (no config file):
mokume features2peptides \
-p features.parquet \
-s experiment.sdrf.tsv \
--filter-min-intensity 500 \
--filter-min-unique-peptides 2 \
--filter-max-missing-rate 0.5 \
--output peptides-filtered.csvmokume correct-batches \
-f ibaq_folder/ \
-p "*ibaq.tsv" \
-o corrected_ibaq.tsv \
--export_anndatamokume tsne-visualization \
-f protein_folder/ \
-o proteins.tsvimport pandas as pd
from mokume.quantification import (
Top3Quantification,
TopNQuantification,
MaxLFQQuantification,
AllPeptidesQuantification,
get_quantification_method,
is_directlfq_available,
peptides_to_protein, # iBAQ function
)
# Load peptide data
peptides = pd.read_csv("peptides.csv")
# --- Top3 Quantification ---
top3 = Top3Quantification()
result = top3.quantify(
peptides,
protein_column="ProteinName",
peptide_column="PeptideSequence",
intensity_column="NormIntensity",
sample_column="SampleID",
)
# --- TopN Quantification (configurable N) ---
topn = TopNQuantification(n=5)
result = topn.quantify(peptides, protein_column="ProteinName", ...)
# --- MaxLFQ Quantification ---
# Automatically uses DirectLFQ if installed, otherwise falls back to built-in
maxlfq = MaxLFQQuantification(
min_peptides=2, # Minimum peptides required for MaxLFQ (uses median for fewer)
threads=4, # Use 4 parallel cores (-1 for all cores)
)
result = maxlfq.quantify(peptides, protein_column="ProteinName", ...)
# Check which implementation is being used
print(f"Using DirectLFQ: {maxlfq.using_directlfq}")
print(f"Implementation: {maxlfq.name}") # "MaxLFQ (DirectLFQ)" or "MaxLFQ (built-in)"
# For best accuracy, install DirectLFQ: pip install mokume[directlfq]
# Force built-in implementation (for testing/comparison)
maxlfq_builtin = MaxLFQQuantification(min_peptides=2, force_builtin=True)
# Run-level quantification (uses built-in implementation)
result = maxlfq.quantify(
peptides,
protein_column="ProteinName",
sample_column="SampleID",
run_column="Run", # Optional: quantify at run level instead of sample level
)
# --- DirectLFQ Quantification (standalone, optional dependency) ---
if is_directlfq_available():
from mokume.quantification import DirectLFQQuantification
directlfq = DirectLFQQuantification(min_nonan=2)
result = directlfq.quantify(peptides, protein_column="ProteinName", ...)
# --- Sum of All Peptides ---
sum_quant = AllPeptidesQuantification()
result = sum_quant.quantify(peptides, protein_column="ProteinName", ...)
# --- Factory Function ---
method = get_quantification_method("maxlfq", min_peptides=2, threads=-1)
result = method.quantify(peptides, ...)
# --- Check available methods ---
from mokume.quantification import list_quantification_methods
print(list_quantification_methods())
# {'top3': True, 'topn': True, 'maxlfq': True, 'directlfq': False, 'sum': True}
# --- iBAQ with Full Pipeline ---
peptides_to_protein(
fasta="proteome.fasta",
peptides="peptides.csv",
enzyme="Trypsin",
normalize=True,
tpa=True,
ruler=True,
ploidy=2,
cpc=200,
organism="human",
output="proteins-ibaq.tsv",
min_aa=7,
max_aa=30,
verbose=True,
qc_report="QC.pdf",
)from mokume.normalization.peptide import peptide_normalization
# Full peptide normalization pipeline
peptide_normalization(
parquet="features.parquet",
sdrf="experiment.sdrf.tsv",
min_aa=7,
min_unique=2,
remove_ids=None,
remove_decoy_contaminants=True,
remove_low_frequency_peptides=True,
output="peptides-norm.csv",
skip_normalization=False,
nmethod="median", # Feature normalization: mean, median, iqr, none
pnmethod="globalMedian", # Peptide normalization: globalMedian, conditionMedian, none
log2=True,
save_parquet=False,
)from mokume.preprocessing.filters import (
load_filter_config,
save_filter_config,
generate_example_config,
get_filter_pipeline,
FilterPipeline,
)
from mokume.model.filters import PreprocessingFilterConfig
# Generate example configuration file
generate_example_config("filters.yaml")
# Load configuration from file
config = load_filter_config("filters.yaml")
# Create configuration programmatically
config = PreprocessingFilterConfig(
name="custom_filters",
enabled=True,
log_filtered_counts=True,
)
config.intensity.min_intensity = 1000.0
config.intensity.cv_threshold = 0.3
config.peptide.allowed_charge_states = [2, 3, 4]
config.peptide.exclude_modifications = ["Oxidation"]
config.protein.min_unique_peptides = 2
config.run_qc.max_missing_rate = 0.5
# Apply CLI-style overrides
config.apply_overrides({
"min_intensity": 500,
"charge_states": [2, 3],
"max_missing_rate": 0.3,
})
# Save configuration
save_filter_config(config, "my_filters.yaml")
# Create and use filter pipeline directly
pipeline = get_filter_pipeline(config)
import pandas as pd
df = pd.read_csv("peptides.csv")
filtered_df, results = pipeline.apply(df)
# Check filter results
for result in results:
print(f"{result.filter_name}: removed {result.removed_count} ({result.removal_rate:.1%})")
# Get pipeline summary
summary = pipeline.summary(results)
print(f"Total removed: {summary['total_removed']} / {summary['total_input']}")
# Use filters with peptide_normalization
from mokume.normalization.peptide import peptide_normalization
peptide_normalization(
parquet="features.parquet",
sdrf="experiment.sdrf.tsv",
output="peptides-filtered.csv",
nmethod="median",
pnmethod="globalMedian",
filter_config=config, # Pass filter configuration
)from mokume.io.parquet import combine_ibaq_tsv_files
from mokume.postprocessing.reshape import pivot_wider, pivot_longer
from mokume.postprocessing.batch_correction import apply_batch_correction
# Load and combine multiple TSV files
df = combine_ibaq_tsv_files("data/", pattern="*ibaq.tsv", sep="\t")
# Reshape to wide format (proteins x samples)
df_wide = pivot_wider(
df,
row_name="ProteinName",
col_name="SampleID",
values="Ibaq",
fillna=True
)
# Extract batch IDs from sample names
import pandas as pd
batch_ids = [name.split("-")[0] for name in df_wide.columns]
batch_ids = pd.factorize(batch_ids)[0]
# Apply ComBat batch correction
df_corrected = apply_batch_correction(df_wide, list(batch_ids), kwargs={})
# Reshape back to long format
df_long = pivot_longer(
df_corrected,
row_name="ProteinName",
col_name="SampleID",
values="IbaqCorrected"
)from mokume.postprocessing.reshape import (
pivot_wider,
pivot_longer,
remove_samples_low_protein_number,
remove_missing_values,
describe_expression_metrics,
)
# Long to wide format
df_wide = pivot_wider(df, row_name="ProteinName", col_name="SampleID", values="Ibaq")
# Wide to long format
df_long = pivot_longer(df_wide, row_name="ProteinName", col_name="SampleID", values="Ibaq")
# Quality filtering
df_filtered = remove_samples_low_protein_number(df, min_protein_num=100)
df_filtered = remove_missing_values(df, missingness_percentage=20, expression_column="Ibaq")
# Get expression statistics
metrics = describe_expression_metrics(df)from mokume.io.parquet import create_anndata
# Create AnnData from long-format DataFrame
adata = create_anndata(
df,
obs_col="SampleID", # Observation (sample) column
var_col="ProteinName", # Variable (protein) column
value_col="Ibaq", # Main data values
layer_cols=["IbaqNorm", "IbaqLog", "IbaqBec"], # Additional layers
obs_metadata_cols=["Condition"], # Sample metadata
var_metadata_cols=["GeneName"], # Protein metadata
)
# Save to h5ad
adata.write("proteins.h5ad")from mokume.io.fasta import (
load_fasta,
digest_protein,
extract_fasta,
get_protein_molecular_weights,
)
# Load FASTA file
proteins = load_fasta("proteome.fasta")
# Digest a single protein sequence
peptides = digest_protein(
sequence="MKWVTFISLLFLFSSAYS...",
enzyme="Trypsin",
min_aa=7,
max_aa=30,
)
# Extract info for specific proteins
unique_peptide_counts, mw_dict, found_proteins = extract_fasta(
fasta="proteome.fasta",
enzyme="Trypsin",
proteins=["P12345", "P67890"],
min_aa=7,
max_aa=30,
tpa=True,
)
# Get molecular weights
mw_dict = get_protein_molecular_weights("proteome.fasta", ["P12345", "P67890"])from mokume.model.organism import OrganismDescription
# Get available organisms
organisms = OrganismDescription.registered_organisms()
# ['human', 'mouse', 'yeast', 'drome', 'caeel', 'schpo']
# Get organism description
human = OrganismDescription.get("human")
print(human.genome_size) # Genome size in base pairs
print(human.histone_entries) # List of histone protein accessions| Column | Description | Method |
|---|---|---|
Ibaq |
Total intensity / theoretical peptides | iBAQ |
IbaqNorm |
ibaq / sum(ibaq) per sample |
iBAQ |
IbaqLog |
10 + log10(IbaqNorm) |
iBAQ |
IbaqPpb |
IbaqNorm * 100,000,000 |
iBAQ |
IbaqBec |
Batch effect corrected | iBAQ + ComBat |
TPA |
NormIntensity / MolecularWeight |
iBAQ |
CopyNumber |
Protein copies per cell | ProteomicRuler |
Concentration[nM] |
Protein concentration | ProteomicRuler |
Top3Intensity |
Average of top 3 peptides | Top3 |
Top{N}Intensity |
Average of top N peptides | TopN |
MaxLFQIntensity |
MaxLFQ algorithm result | MaxLFQ |
DirectLFQIntensity |
DirectLFQ intensity traces | DirectLFQ |
SumIntensity |
Sum of all peptides | Sum |
| Method | Description |
|---|---|
median |
Normalize by median across MS runs |
mean |
Normalize by mean across MS runs |
iqr |
Normalize by interquartile range |
none |
Skip feature normalization |
| Method | Description |
|---|---|
globalMedian |
Adjust all samples to global median |
conditionMedian |
Adjust samples within each condition to median |
none |
Skip peptide normalization |
mokume provides a comprehensive filter system for quality control. Filters can be configured via YAML/JSON files or CLI options.
| Filter | Parameter | Default | Description |
|---|---|---|---|
| MinIntensityFilter | min_intensity |
0.0 | Remove features below threshold |
| CVThresholdFilter | cv_threshold |
null | Max CV across replicates |
| ReplicateAgreementFilter | min_replicate_agreement |
1 | Min replicates with detection |
| QuantileFilter | quantile_lower/upper |
0.0/1.0 | Remove intensity outliers |
| Filter | Parameter | Default | Description |
|---|---|---|---|
| PeptideLengthFilter | min/max_peptide_length |
7/50 | Peptide length range |
| ChargeStateFilter | allowed_charge_states |
null | Allowed charges (e.g., [2,3,4]) |
| ModificationFilter | exclude_modifications |
[] | Remove specific modifications |
| MissedCleavageFilter | max_missed_cleavages |
null | Max missed cleavages |
| SearchScoreFilter | min_search_score |
null | Min search engine score |
| SequencePatternFilter | exclude_sequence_patterns |
[] | Regex patterns to exclude |
| Filter | Parameter | Default | Description |
|---|---|---|---|
| ContaminantFilter | remove_contaminants/decoys |
true | Remove contaminants/decoys |
| MinPeptideFilter | min_unique_peptides |
2 | Min unique peptides per protein |
| ProteinFDRFilter | fdr_threshold |
0.01 | Protein-level FDR |
| CoverageFilter | min_coverage |
0.0 | Min sequence coverage |
| RazorPeptideFilter | razor_peptide_handling |
"keep" | Handle shared peptides |
| Filter | Parameter | Default | Description |
|---|---|---|---|
| RunIntensityFilter | min_total_intensity |
0.0 | Min total intensity per run |
| MinFeaturesFilter | min_identified_features |
0 | Min features per run |
| MissingRateFilter | max_missing_rate |
1.0 | Max missing value rate |
| SampleCorrelationFilter | min_sample_correlation |
null | Min replicate correlation |
We provide several pre-configured filter templates for common use cases in tests/example/filters/:
| Configuration | Use Case | Description |
|---|---|---|
basic_qc.yaml |
General QC | Minimal filtering for standard experiments |
stringent_filtering.yaml |
Publication | High-confidence results with strict thresholds |
tmt_labeling.yaml |
TMT/iTRAQ | Optimized for multiplexed labeling experiments |
dia_analysis.yaml |
DIA | Optimized for DIA-NN, Spectronaut analysis |
exploratory_analysis.yaml |
Exploration | Minimal filtering for data exploration |
Example: Basic QC Configuration
# basic_qc.yaml - Minimal filtering for standard experiments
name: basic_qc
enabled: true
intensity:
remove_zero_intensity: true
peptide:
min_peptide_length: 7
max_peptide_length: 50
protein:
min_unique_peptides: 2
remove_contaminants: true
remove_decoys: true
contaminant_patterns:
- CONTAMINANT
- ENTRAP
- DECOYUse these configurations directly:
mokume features2peptides \
-p features.parquet \
--filter-config tests/example/filters/basic_qc.yaml \
-o peptides.csv- Parse protein identifiers and retain unique peptides
- Remove entries with empty intensity or condition
- Filter peptides by minimum amino acids
- Remove low-confidence proteins (< min_unique peptides)
- Optionally remove decoys, contaminants, and specified proteins
- Normalize at feature level between MS runs
- Merge peptidoforms across fractions and technical replicates
- Normalize at sample level
- Remove low-frequency peptides
- Assemble peptidoforms to peptides
- Optional log2 transformation
- Load peptide intensity data
- Extract protein info from FASTA (theoretical peptide counts, MW)
- Group peptide intensities by protein, sample, and condition
- Sum protein intensities within each group
- Normalize by detected peptide count
- Divide by theoretical peptide count
- Optional: Calculate TPA, copy number, concentration
Zheng P, Audain E, Webel H, Dai C, Klein J, Hitz MP, Sachsenberg T, Bai M, Perez-Riverol Y. Ibaqpy: A scalable Python package for baseline quantification in proteomics leveraging SDRF metadata. J Proteomics. 2025;317:105440. doi: 10.1016/j.jprot.2025.105440.
Wang H, Dai C, Pfeuffer J, Sachsenberg T, Sanchez A, Bai M, Perez-Riverol Y. Tissue-based absolute quantification using large-scale TMT and LFQ experiments. Proteomics. 2023;23(20):e2300188. doi: 10.1002/pmic.202300188.
MIT License - see LICENSE for details.