Skip to content
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -52,7 +52,7 @@ rsconnect/
IonQuant-1.10.27.jar
MSFragger-4.1.jar
diaTracer-1.1.5.jar

fragPipe-23.1/
# data folder contents, tmp folder
data/*
tmp/
Expand Down
115 changes: 115 additions & 0 deletions pre-processing/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,115 @@
# FragPipe Peptide Preprocessing Workflow

This CWL workflow replicates the exact functionality of the first 27 lines of `run_fragpipe.sh` for preprocessing peptide sequences before FragPipe analysis.

## Overview

The workflow uses Philosopher to process two input files:
1. **Custom peptide sequences** (custom.fasta)
2. **UniProt canonical human proteome** (UP000005640_9606.fasta.gz - gzipped)

Philosopher automatically:
- Adds the canonical sequences via stdin (`--add /dev/stdin`)
- Adds contaminants from its built-in database (`--contam`)
- Generates decoy sequences by reversing/shuffling
- Outputs a combined FASTA file ready for FragPipe

## Directory Structure

```
pre-processing/
├── workflows/
│ └── peptide-preprocessing-workflow.cwl # Main workflow
├── tools/
│ └── philosopher-database.cwl # Philosopher database processing tool
├── params/
│ └── example-inputs.yml # Example input configuration
├── tests/
│ └── test_data/ # Test FASTA files
└── README.md # This file
```

## Input Files Required

1. **custom_fasta**: Your custom peptide sequences (FASTA format)
2. **uniprot_canonical_fasta**: UniProt canonical human proteome (gzipped FASTA)

## Philosopher Processing

The workflow executes these exact commands from the original script:

```bash
# Initialize workspace
philosopher workspace --init

# Process databases with decoys and contaminants
gunzip -c UP000005640_9606.fasta.gz | philosopher database --custom custom.fasta --add /dev/stdin --contam

# Output file is automatically created: *decoys-contam-custom.fasta.fas
# Renamed to: decoys-contam-custom-canonical.fasta

# Clean workspace
philosopher workspace --clean
```

## Usage

### Local Testing

```bash
# Install cwltool
pip install cwltool

# Run workflow locally (from pre-processing directory)
cd pre-processing
cwltool --outdir ../results workflows/peptide-preprocessing-workflow.cwl params/example-inputs.yml
```

## Input Configuration

Edit the YAML input file to specify your files:

```yaml
custom_fasta:
class: File
path: "custom.fasta"
format: "http://edamontology.org/format_1929"

uniprot_canonical_fasta:
class: File
path: "UP000005640_9606.fasta.gz"
format: "http://edamontology.org/format_3989"
```

## Workflow Steps

The workflow consists of a single step:

1. **philosopher_database**: Runs the complete Philosopher database processing
- Initializes workspace
- Processes custom sequences with canonical sequences via stdin
- Adds contaminants from Philosopher's built-in database
- Generates decoy sequences automatically
- Renames output file to `decoys-contam-custom-canonical.fasta`
- Cleans workspace

## Outputs

- **merged_fasta**: Final combined FASTA file (`decoys-contam-custom-canonical.fasta`)
- **philosopher_log**: Processing log with command details and sequence counts

## Docker Requirements

The workflow uses the FragPipe Docker image:
- Image: `pgc-images.sbgenomics.com/rokita-lab/fragpipe:latest`
- Philosopher v5.1.2 tool at `/fragpipe_bin/fragPipe-23.1/fragpipe-23.1/tools/Philosopher/philosopher-v5.1.2`
- All required dependencies included

## Notes

- This workflow exactly replicates the Philosopher commands from `run_fragpipe.sh`
- Philosopher handles all the complex processing: decompression, merging, decoy generation, and contamination
- The `--contam` flag uses Philosopher's built-in contaminant database
- The `--add /dev/stdin` approach allows piping the gzipped canonical sequences directly
- Output file naming and renaming is handled within the single processing step
- The workflow has been simplified to use a single step instead of separate processing and renaming steps
12 changes: 12 additions & 0 deletions pre-processing/params/example-inputs.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,12 @@
# Example CWL Job Inputs for FragPipe Peptide Preprocessing Workflow
# This example assumes files are in the input/ directory relative to the workflow

custom_fasta:
class: File
path: "../../input/custom.fasta"
doc: "Custom peptide sequences of interest - user-provided FASTA file"

uniprot_canonical_fasta:
class: File
path: "../../input/UP000005640_9606.fasta.gz"
doc: "Gzipped UniProt canonical human proteome sequences (Homo sapiens, reference proteome)"
108 changes: 108 additions & 0 deletions pre-processing/tools/philosopher-database.cwl
Original file line number Diff line number Diff line change
@@ -0,0 +1,108 @@
cwlVersion: v1.2
class: CommandLineTool

label: Philosopher Database Processing
doc: |
Replicates the exact Philosopher commands from run_fragpipe.sh:
1. philosopher workspace --init
2. gunzip -c UP000005640_9606.fasta.gz | philosopher database --custom custom.fasta --add /dev/stdin --contam
3. mv *decoys-contam-custom.fasta.fas to output
4. philosopher workspace --clean

requirements:
- class: ShellCommandRequirement
- class: InitialWorkDirRequirement
listing:
- $(inputs.custom_fasta)
- $(inputs.canonical_fasta_gz)
- class: DockerRequirement
dockerPull: pgc-images.sbgenomics.com/rokita-lab/fragpipe:latest

baseCommand: ["/bin/bash"]

arguments:
- valueFrom: |
set -e
set -o pipefail

# Get input file paths
CUSTOM_FASTA="$(inputs.custom_fasta.path)"
CANONICAL_FASTA_GZ="$(inputs.canonical_fasta_gz.path)"

# Use the philosopher binary from the FragPipe container, note this path is different from the copied tools
PHILOSOPHER_BIN="/fragpipe_bin/fragPipe-23.1/fragpipe-23.1/tools/Philosopher/philosopher-v5.1.2"

echo "Starting Philosopher database processing..."
echo "Custom FASTA: $CUSTOM_FASTA"
echo "Canonical FASTA (gzipped): $CANONICAL_FASTA_GZ"

# Initialize the Philosopher workspace
echo "Initializing Philosopher workspace..."
$PHILOSOPHER_BIN workspace --init

# Add canonical UniProt FASTA sequences and contaminants to the custom FASTA file
echo "Adding canonical UniProt FASTA sequences and contaminants..."
gunzip -c "$CANONICAL_FASTA_GZ" | $PHILOSOPHER_BIN database --custom "$CUSTOM_FASTA" --add /dev/stdin --contam

# Find the output file (should match pattern *decoys-contam-custom.fasta.fas)
OUTPUT_FILE=`ls *decoys-contam-custom.fasta.fas 2>/dev/null | head -1`
if [ -z "$OUTPUT_FILE" ]; then
echo "Error: Could not find expected output file *decoys-contam-custom.fasta.fas"
echo "Available files:"
ls -la
exit 1
fi

echo "Found output file: $OUTPUT_FILE"

# Copy to standard output name for CWL
cp "$OUTPUT_FILE" decoys-contam-custom-canonical.fasta

# Create log file with detailed information
echo "Philosopher database processing completed successfully" > philosopher.log
echo "Command executed: gunzip -c $CANONICAL_FASTA_GZ | philosopher database --custom $CUSTOM_FASTA --add /dev/stdin --contam" >> philosopher.log
echo "Input files:" >> philosopher.log
echo " - Custom FASTA: $CUSTOM_FASTA" >> philosopher.log
echo " - Canonical FASTA (gzipped): $CANONICAL_FASTA_GZ" >> philosopher.log
echo "Output file generated: $OUTPUT_FILE" >> philosopher.log
echo "Final output: decoys-contam-custom-canonical.fasta" >> philosopher.log

# Count sequences in files
echo "Sequence counts:" >> philosopher.log
echo " - Custom sequences: `grep -c '^>' "$CUSTOM_FASTA" || echo 0`" >> philosopher.log
echo " - Final combined (with decoys/contaminants): `grep -c '^>' decoys-contam-custom-canonical.fasta || echo 0`" >> philosopher.log

# Clean intermediate files (replicates: philosopher workspace --clean)
echo "Cleaning workspace..."
$PHILOSOPHER_BIN workspace --clean

echo "Philosopher database processing completed successfully"
position: 1
prefix: "-c"

inputs:
custom_fasta:
type: File
doc: Custom FASTA file with peptide sequences of interest

canonical_fasta_gz:
type: File
doc: Gzipped canonical UniProt FASTA file (UP000005640_9606.fasta.gz)

outputs:
output_fasta:
type: File
outputBinding:
glob: "decoys-contam-custom-canonical.fasta"
doc: FASTA file with custom sequences, canonical sequences, contaminants, and decoys

log_file:
type: File
outputBinding:
glob: "philosopher.log"
doc: Log file from Philosopher processing

hints:
- class: ResourceRequirement
coresMin: 1
ramMin: 2048
51 changes: 51 additions & 0 deletions pre-processing/workflows/peptide-preprocessing-workflow.cwl
Original file line number Diff line number Diff line change
@@ -0,0 +1,51 @@
cwlVersion: v1.2
class: Workflow

label: FragPipe Peptide Preprocessing Workflow
doc: |
This workflow replicates the exact functionality of the first 27 lines of run_fragpipe.sh.
Uses Philosopher to combine custom sequences with canonical UniProt sequences,
add contaminants from Philosopher's database, and generate decoy sequences.

requirements:
- class: ShellCommandRequirement
- class: StepInputExpressionRequirement

inputs:
custom_fasta:
type: File
label: Custom peptide sequences
doc: FASTA file containing custom peptide sequences of interest

uniprot_canonical_fasta:
type: File
label: UniProt canonical sequences (gzipped)
doc: Gzipped FASTA file containing UniProt canonical protein sequences (UP000005640_9606.fasta.gz)

outputs:
merged_fasta:
type: File
label: Final merged FASTA with decoys and contaminants
doc: Combined FASTA file with custom sequences, canonical sequences, contaminants, and decoys
outputSource: philosopher_database/output_fasta

philosopher_log:
type: File?
label: Philosopher processing log
doc: Log file from Philosopher database processing
outputSource: philosopher_database/log_file

steps:
philosopher_database:
run: ../tools/philosopher-database.cwl
in:
custom_fasta: custom_fasta
canonical_fasta_gz: uniprot_canonical_fasta
out: [output_fasta, log_file]

$namespaces:
sbg: https://sevenbridges.com

hints:
- class: sbg:maxNumberOfParallelInstances
value: 1