Skip to content

Commit e3525b0

Browse files
authored
Merge branch 'main' into dev
2 parents 48673fc + dfd4e19 commit e3525b0

File tree

10 files changed

+641
-78
lines changed

10 files changed

+641
-78
lines changed

.markdownlint.json

Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,8 @@
1+
{
2+
"MD046": {
3+
"style": "fenced"
4+
},
5+
"MD013": false,
6+
"MD033": false,
7+
"MD041": false
8+
}

docs/PSM_SCHEMA.md

Lines changed: 67 additions & 29 deletions
Original file line numberDiff line numberDiff line change
@@ -43,42 +43,54 @@ Verified the conversion from MaxQuant `msms.txt` to PSM parquet format:
4343

4444
## PSM Schema Fields (22 total)
4545

46+
### Field Classification
47+
48+
Fields are classified as:
49+
50+
- **PK** (Primary Key): Must not be null, required for data integrity
51+
- **nullable**: Column always exists, but value can be null
52+
- **optional**: Column may not exist in file at all
53+
4654
### Core Identification Fields (14)
4755

48-
| Field | Type | Description |
49-
| --------------------------- | ------------ | ---------------------------------------------- |
50-
| sequence | string | Unmodified peptide sequence |
51-
| peptidoform | string | Peptide sequence with modifications (ProForma) |
52-
| modifications | list[struct] | Modification details with positions and scores |
53-
| precursor_charge | int32 | Precursor ion charge |
54-
| posterior_error_probability | float64 | PEP value |
55-
| is_decoy | int32 | Decoy indicator (1=decoy, 0=target) |
56-
| calculated_mz | float32 | Theoretical m/z |
57-
| observed_mz | float32 | Experimental m/z |
58-
| additional_scores | list[struct] | Search engine scores with name, value, and higher_better direction indicator |
59-
| predicted_rt | float32 | Predicted retention time (seconds) |
60-
| reference_file_name | string | Reference file name |
61-
| cv_params | list[struct] | CV parameters |
62-
| scan | string | Scan number |
63-
| rt | float32 | MS2 retention time (seconds) |
56+
| Field | Type | Classification | Description |
57+
| --------------------------- | ------------ | -------------- | ---------------------------------------------- |
58+
| sequence | string | PK | Unmodified peptide sequence |
59+
| peptidoform | string | PK | Peptide sequence with modifications (ProForma) |
60+
| modifications | list[struct] | nullable | Modification details with positions and scores |
61+
| precursor_charge | int32 | PK | Precursor ion charge |
62+
| posterior_error_probability | float32 | nullable | PEP value |
63+
| is_decoy | int32 | required | Decoy indicator (1=decoy, 0=target) |
64+
| calculated_mz | float32 | required | Theoretical m/z |
65+
| observed_mz | float32 | required | Experimental m/z |
66+
| additional_scores | list[struct] | nullable | Search engine scores |
67+
| predicted_rt | float32 | nullable | Predicted retention time (seconds) |
68+
| reference_file_name | string | PK | Reference file name |
69+
| cv_params | list[struct] | nullable | CV parameters |
70+
| scan | string | PK | Scan number |
71+
| rt | float32 | nullable | MS2 retention time (seconds) |
6472

6573
### Protein Mapping Fields (1)
6674

67-
| Field | Type | Description |
68-
| ------------------ | ------------ | ------------------ |
69-
| protein_accessions | list[string] | Protein accessions |
75+
| Field | Type | Classification | Description |
76+
| ------------------ | ------------ | -------------- | ------------------ |
77+
| protein_accessions | list[string] | nullable | Protein accessions |
78+
79+
### Spectral Data Fields (7) - Optional
80+
81+
**Note**: These fields are **optional** - they may not exist in the file at all. Use `--spectral-data` flag during conversion to include these columns.
7082

71-
### Spectral Data Fields (7)
83+
| Field | Type | Classification | Description |
84+
| ------------------ | ------------- | -------------- | ------------------------------------ |
85+
| ion_mobility | float32 | optional | Ion mobility value |
86+
| number_peaks | int32 | optional | Number of peaks |
87+
| mz_array | list[float32] | optional | m/z values array |
88+
| intensity_array | list[float32] | optional | Intensity values array |
89+
| charge_array | list[int32] | optional | Fragment ion charge array |
90+
| ion_type_array | list[string] | optional | Ion type annotations (b, y, a, etc.) |
91+
| ion_mobility_array | list[float32] | optional | Fragment ion mobility array |
7292

73-
| Field | Type | Description |
74-
| ------------------ | ------------- | ------------------------------------ |
75-
| ion_mobility | float32 | Ion mobility value |
76-
| number_peaks | int32 | Number of peaks |
77-
| mz_array | list[float32] | m/z values array |
78-
| intensity_array | list[float32] | Intensity values array |
79-
| charge_array | list[int32] | Fragment ion charge array |
80-
| ion_type_array | list[string] | Ion type annotations (b, y, a, etc.) |
81-
| ion_mobility_array | list[float32] | Fragment ion mobility array |
93+
**Nullable vs Optional**: These fields are *optional* (column may be absent), not just *nullable* (column exists with null values). See [Field Classification](format-specification.md#field-classification) for details.
8294

8395
## How to Generate Examples
8496

@@ -152,6 +164,32 @@ When `--spectral-data` is enabled, spectral arrays are populated:
152164
}
153165
```
154166

167+
### PSM with Fragmentation Information (via cv_params)
168+
169+
Fragmentation method and collision energy should be stored as CV terms in the `cv_params` field:
170+
171+
```json
172+
{
173+
"sequence": "PEPTIDESEQ",
174+
"cv_params": [
175+
{ "cv_name": "dissociation method", "cv_value": "HCD" },
176+
{ "cv_name": "normalized collision energy", "cv_value": "28" }
177+
]
178+
}
179+
```
180+
181+
**Why human-readable names?** QPX uses readable term names (like `dissociation method`) instead of ontology accessions (like `MS:1000044`) to align with successful omics formats such as GTF and AnnData. This makes data self-documenting while the specification provides formal definitions. See [Design Philosophy](format-specification.md#psm-cv-params) for details.
182+
183+
**Common fragmentation CV terms:**
184+
185+
| CV Name | Example Values |
186+
|---------|----------------|
187+
| dissociation method | HCD, CID, ETD, ECD, UVPD |
188+
| collision energy | 28, 35 (in eV) |
189+
| normalized collision energy | 28, 30 (percentage) |
190+
191+
Full reference with PSI-MS accessions: [Common CV Terms](format-specification.md#common-cv-terms)
192+
155193
### De Novo / No Protein Association
156194

157195
For De Novo analysis where PSMs have no protein mapping, `protein_accessions` is nullable:

docs/cli-reference.md

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -22,7 +22,7 @@ Convert various mass spectrometry data formats to the QPX standard format:
2222

2323
Transform and process data within the QPX ecosystem:
2424

25-
- **Absolute Expression (AE)**: Convert iBAQ absolute expression data to QPX format ([format specification](https://io.quantms.org/format-specification/#absolute))
25+
- **Absolute Expression (AE)**: Convert iBAQ absolute expression data to QPX format ([format specification](format-specification.md#absolute))
2626
- **Differential Expression (DE)**: Convert MSstats differential expression analysis results
2727
- **Gene Mapping**: Map gene information to protein data
2828
- **iBAQ Transformation**: Process iBAQ quantification files
@@ -128,7 +128,7 @@ qpxc stats analyze psm \
128128
## Getting Help
129129

130130
- Each command provides detailed help information using the `--help` parameter
131-
- See [Format Specification](format-specification.md) for output file formats
132-
- View the [online format specification](https://io.quantms.org/format-specification/) for detailed schema information
131+
- See [Format Specification](format-specification.md) for output file formats and detailed schema information
132+
- Check the [Troubleshooting Guide](troubleshooting.md) for common issues
133133
- Visit the [GitHub Repository](https://github.com/bigbio/qpx) to report issues
134134

docs/cli-transform.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -118,7 +118,7 @@ from qpx.commands.transform.ae import convert_ibaq_absolute_cmd
118118
print(generate_description(convert_ibaq_absolute_cmd))
119119
```
120120

121-
**Format Specification**: For details about the AE format structure and fields, see the [Absolute Expression Format Specification](https://io.quantms.org/format-specification/#absolute).
121+
**Format Specification**: For details about the AE format structure and fields, see the [Absolute Expression Format Specification](format-specification.md#absolute).
122122

123123
### Parameters {#ae-parameters}
124124

@@ -174,7 +174,7 @@ Q67890 500000 480000 520000
174174

175175
- **Output**: `{output-prefix}-{uuid}.absolute.parquet`
176176
- **Format**: Parquet file containing absolute expression quantification
177-
- **Schema**: Conforms to [QPX absolute expression specification](https://io.quantms.org/format-specification/#absolute)
177+
- **Schema**: Conforms to [QPX absolute expression specification](format-specification.md#absolute)
178178

179179
### Common Issues {#ae-issues}
180180

0 commit comments

Comments
 (0)