-
Notifications
You must be signed in to change notification settings - Fork 2
Changes to support qpx #6
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
📝 WalkthroughWalkthroughThe PR introduces SDRF metadata enrichment to the peptide normalization module. Changes include robust fallback logic for technical replicate extraction, a new Changes
Sequence Diagram(s)sequenceDiagram
actor User
participant Feature as Feature Class
participant DuckDB as DuckDB Engine
participant SDRF as SDRF File
participant View as View Layer
User->>Feature: enrich_with_sdrf(sdrf_path)
Feature->>SDRF: Load SDRF metadata
SDRF-->>Feature: SDRF data
Feature->>Feature: Derive condition/bio_rep/fraction mappings
Feature->>DuckDB: Register SDRF mapping table
DuckDB-->>Feature: Mapping registered
Feature->>DuckDB: Create unnested intensities view
DuckDB-->>Feature: Unnested view created
Feature->>View: Recreate parquet_db view<br/>(join with SDRF)
View-->>Feature: Enriched view ready
Feature-->>User: SDRF enrichment complete
sequenceDiagram
participant DataLoad as Data Load
participant Feature as Feature.__init__
participant DuckDB as DuckDB
participant View as View Creation
DataLoad->>Feature: Initialize with parquet
Feature->>DuckDB: parquet_scan (create parquet_db_raw)
DuckDB-->>View: Raw view available
View->>DuckDB: UNNEST nested intensities
DuckDB-->>Feature: parquet_db view (long-format)
Note over Feature: Data reshaped:<br/>Nested → Unnested/Long-format
Feature-->>DataLoad: Feature ready for processing
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Poem
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 inconclusive)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing touches
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 4
🤖 Fix all issues with AI agents
In `@mokume/normalization/peptide.py`:
- Around line 515-519: The CREATE VIEW call directly injects database_path into
the SQL string via format, which can break on quotes or permit SQL injection;
update the call that uses self.parquet_db.execute so the parquet_scan path is
passed as a bound parameter instead of string-formatted in the SQL (i.e. replace
the formatted "SELECT * FROM parquet_scan('{}')" usage with a parameter
placeholder and pass database_path as the parameter to execute). Ensure you
modify the code that constructs the SQL (the parquet_scan invocation inside the
CREATE VIEW statement) and run the query using the DB API's parameter binding
mechanism, then test with paths containing quotes and special chars to confirm
correct behavior.
- Around line 325-337: The code tries to use pandas .str methods on data_df[RUN]
without ensuring it's a string dtype, causing AttributeError for numeric RUNs
and bypassing the except fallback; update the block referencing data_df, RUN and
TECHREPLICATE to cast the RUN column to string first (e.g., use
data_df[RUN].astype(str) before .str.contains, .str.split, .str.get) so numeric
run names are handled correctly, keep the existing int casts/except fallback
logic intact; make the same change in the analogous block around the code that
handles lines 743-755.
In `@tests/test_peptide_normalize.py`:
- Around line 198-214: The test fails if the output parent directory doesn't
exist: before calling peptide_normalization(**args) (where args["output"] is
used and WriteCSVTask writes the file), ensure the parent directory of
Path(args["output"]) is created (e.g., out.parent.mkdir(parents=True,
exist_ok=True)) so the file can be written; apply this same fix to the other
similar test blocks in tests/test_peptide_normalize.py that construct out and
call peptide_normalization.
- Around line 116-117: The linter warnings come from unused variables in the
test loops and an unused assignment; in the loop iterating over med_map replace
the unused loop variable `sample` with `_`, similarly replace the unused
`condition` loop variable with `_`, and remove the unused `samples` assignment
(or stop assigning to it) so only used variables remain; update occurrences near
the med_map loop and the two other loops that reference `condition` and
`samples` in tests/test_peptide_normalize.py (keep the loop structure and
assertions intact).
| # Try to extract technical replicate from run name | ||
| try: | ||
| if data_df[RUN].str.contains("_").all(): | ||
| # Get the last part after underscore (e.g., "S1_Brain_2" -> "2") | ||
| last_parts = data_df[RUN].str.split("_").str.get(-1) | ||
| data_df[TECHREPLICATE] = last_parts.astype("int") | ||
| else: | ||
| data_df[TECHREPLICATE] = data_df[RUN].astype("int") | ||
| except (ValueError, TypeError): | ||
| # Fall back to using run index | ||
| unique_runs = data_df[RUN].unique().tolist() | ||
| run_to_index = {run: i + 1 for i, run in enumerate(unique_runs)} | ||
| data_df[TECHREPLICATE] = data_df[RUN].map(run_to_index) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🧩 Analysis chain
🏁 Script executed:
# First, let's find and examine the peptide.py file
find . -type f -name "peptide.py" | head -20Repository: bigbio/mokume
Length of output: 132
🏁 Script executed:
# Read the file around lines 325-337
cat -n mokume/normalization/peptide.py | sed -n '315,345p'Repository: bigbio/mokume
Length of output: 1567
🏁 Script executed:
# Also read the file around lines 743-755
cat -n mokume/normalization/peptide.py | sed -n '735,765p'Repository: bigbio/mokume
Length of output: 1664
🏁 Script executed:
# Check what RUN and TECHREPLICATE are (likely constants)
rg "^RUN\s*=" mokume/normalization/peptide.py | head -5
rg "^TECHREPLICATE\s*=" mokume/normalization/peptide.py | head -5Repository: bigbio/mokume
Length of output: 39
🏁 Script executed:
# Search for constant definitions
rg "(RUN|TECHREPLICATE)" mokume/normalization/peptide.py | grep -E "(^[^:]*:\s*(RUN|TECHREPLICATE)\s*=)" | head -10Repository: bigbio/mokume
Length of output: 39
🏁 Script executed:
# Check imports and constants at the top of the file
head -100 mokume/normalization/peptide.py | cat -nRepository: bigbio/mokume
Length of output: 3749
🏁 Script executed:
# Let's create a test to understand the behavior
python3 << 'EOF'
import pandas as pd
# Test 1: str.contains() on numeric dtype
df_numeric = pd.DataFrame({'col': [1, 2, 3]})
try:
result = df_numeric['col'].str.contains("_")
print(f"Test 1 result: {result}")
except AttributeError as e:
print(f"Test 1 - AttributeError: {e}")
except Exception as e:
print(f"Test 1 - Other exception ({type(e).__name__}): {e}")
# Test 2: str.contains() on string dtype
df_string = pd.DataFrame({'col': ['A_1', 'B_2', 'C_3']})
try:
result = df_string['col'].str.contains("_")
print(f"Test 2 result:\n{result.tolist()}")
except Exception as e:
print(f"Test 2 - Exception: {e}")
# Test 3: Conversion and then str.contains()
df_numeric = pd.DataFrame({'col': [1, 2, 3]})
try:
result = df_numeric['col'].astype(str).str.contains("_")
print(f"Test 3 result:\n{result.tolist()}")
except Exception as e:
print(f"Test 3 - Exception: {e}")
# Test 4: What happens with mixed data (strings with numbers)
df_mixed = pd.DataFrame({'col': ['S1_Brain_2', 'S2_Brain_3', 123]})
try:
result = df_mixed['col'].str.contains("_")
print(f"Test 4 result:\n{result.tolist()}")
except Exception as e:
print(f"Test 4 - Exception ({type(e).__name__}): {e}")
EOFRepository: bigbio/mokume
Length of output: 143
🏁 Script executed:
# Simpler test with a timeout and smaller scope
timeout 30 python3 -c "
import pandas as pd
# Test what exception is raised on numeric dtype
df = pd.DataFrame({'col': [1, 2, 3]})
try:
df['col'].str.contains('_')
except AttributeError:
print('AttributeError')
except Exception as e:
print(type(e).__name__)
"Repository: bigbio/mokume
Length of output: 104
🏁 Script executed:
# Check what happens when RUN column contains numeric values
# Let's look at the data loading code to understand the dtype
rg -A 10 -B 5 "read_csv|read_parquet" mokume/normalization/peptide.py | head -50Repository: bigbio/mokume
Length of output: 2091
🏁 Script executed:
# Look at error handling around the RUN field
rg -B 5 -A 5 "RUN\]" mokume/normalization/peptide.py | head -80Repository: bigbio/mokume
Length of output: 1698
🏁 Script executed:
# Check how the RUN column is loaded/determined
rg -B 10 -A 10 "data_df\s*=.*read" mokume/normalization/peptide.py | head -60Repository: bigbio/mokume
Length of output: 39
🏁 Script executed:
# Search for any type conversion or dtype specification for RUN
rg "RUN.*dtype|astype.*RUN" mokume/normalization/peptide.pyRepository: bigbio/mokume
Length of output: 39
🏁 Script executed:
# Check if there are any tests or examples showing RUN with numeric values
find . -name "*.py" -type f -exec rg -l "RUN.*=.*[0-9]" {} \; | head -10Repository: bigbio/mokume
Length of output: 39
🏁 Script executed:
# Verify pandas behavior with simple inline Python
python3 << 'EOF'
import pandas as pd
import sys
# Test .str accessor on numeric dtype
try:
df = pd.DataFrame({'col': [1, 2, 3]})
result = df['col'].str.contains("_")
print("No exception raised")
except AttributeError as e:
print(f"AttributeError: {str(e)[:80]}")
except Exception as e:
print(f"Other {type(e).__name__}: {str(e)[:80]}")
# Test conversion approach
try:
df = pd.DataFrame({'col': [1, 2, 3]})
result = df['col'].astype(str).str.contains("_")
print(f"After astype(str): Success - {result.tolist()}")
except Exception as e:
print(f"After astype(str): {type(e).__name__}")
EOFRepository: bigbio/mokume
Length of output: 175
🏁 Script executed:
# Let's check the data flow more carefully - look at where data_df is created
rg -B 30 "data_df\s*=" mokume/normalization/peptide.py | grep -A 30 "def peptide_normalization"Repository: bigbio/mokume
Length of output: 39
🏁 Script executed:
# Check function signature and early operations
sed -n '250,330p' mokume/normalization/peptide.pyRepository: bigbio/mokume
Length of output: 2915
🏁 Script executed:
# Look for any test files that might show what RUN values look like
find . -name "*test*.py" -type f | xargs rg "RUN" | head -20Repository: bigbio/mokume
Length of output: 414
🏁 Script executed:
# Check if there's any documentation about the expected RUN format
rg "RUN" mokume/core/constants.py -B 3 -A 3Repository: bigbio/mokume
Length of output: 580
🏁 Script executed:
# Let's look at the second location (743-755) more carefully
sed -n '730,760p' mokume/normalization/peptide.pyRepository: bigbio/mokume
Length of output: 1440
🏁 Script executed:
# Check if there are any integration tests or actual data examples
find . -name "*.parquet" -o -name "*.csv" | head -5Repository: bigbio/mokume
Length of output: 228
🏁 Script executed:
# Verify pandas behavior documentation by looking at existing error handling
rg "AttributeError" mokume/ | head -10Repository: bigbio/mokume
Length of output: 268
🏁 Script executed:
# Check what column type the parquet_db returns for "run"
rg "parquet_db.*sql" mokume/normalization/peptide.py -B 2 -A 5Repository: bigbio/mokume
Length of output: 3497
Add string casting before .str accessor to handle numeric RUN dtypes.
If the RUN column is numeric, calling .str.contains("_") raises AttributeError instead of ValueError/TypeError, which bypasses the fallback logic. Cast to string first to ensure the code handles numeric runs as documented (case 2 in the docstring: "If run name is numeric, use it directly").
🛠️ Proposed fix
@@
- if data_df[RUN].str.contains("_").all():
+ run_series = data_df[RUN].astype(str)
+ if run_series.str.contains("_").all():
# Get the last part after underscore (e.g., "S1_Brain_2" -> "2")
- last_parts = data_df[RUN].str.split("_").str.get(-1)
- data_df[TECHREPLICATE] = last_parts.astype("int")
+ last_parts = run_series.str.split("_").str.get(-1)
+ data_df[TECHREPLICATE] = pd.to_numeric(last_parts, errors="raise").astype("int")
else:
- data_df[TECHREPLICATE] = data_df[RUN].astype("int")
+ data_df[TECHREPLICATE] = pd.to_numeric(data_df[RUN], errors="raise").astype("int")
@@
- if unique["run"].str.contains("_").all():
+ run_series = unique["run"].astype(str)
+ if run_series.str.contains("_").all():
# Get the last part after splitting by underscore
- last_parts = unique["run"].str.split("_").str.get(-1)
- unique["run"] = last_parts.astype("int")
+ last_parts = run_series.str.split("_").str.get(-1)
+ unique["run"] = pd.to_numeric(last_parts, errors="raise").astype("int")
else:
# Try to convert directly to int
- unique["run"] = unique["run"].astype("int")
+ unique["run"] = pd.to_numeric(unique["run"], errors="raise").astype("int")Also applies to: 743-755
🤖 Prompt for AI Agents
In `@mokume/normalization/peptide.py` around lines 325 - 337, The code tries to
use pandas .str methods on data_df[RUN] without ensuring it's a string dtype,
causing AttributeError for numeric RUNs and bypassing the except fallback;
update the block referencing data_df, RUN and TECHREPLICATE to cast the RUN
column to string first (e.g., use data_df[RUN].astype(str) before .str.contains,
.str.split, .str.get) so numeric run names are handled correctly, keep the
existing int casts/except fallback logic intact; make the same change in the
analogous block around the code that handles lines 743-755.
| # Create raw view from parquet | ||
| self.parquet_db.execute( | ||
| "CREATE VIEW parquet_db_raw AS SELECT * FROM parquet_scan('{}')".format( | ||
| database_path | ||
| ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🧩 Analysis chain
🏁 Script executed:
cat -n mokume/normalization/peptide.py | sed -n '510,525p'Repository: bigbio/mokume
Length of output: 813
🌐 Web query:
DuckDB parquet_scan parameterized queries bound parameters
💡 Result:
DuckDB supports bound parameters in prepared statements using ?, positional $1/$2/..., or named $param placeholders. You can use these placeholders in normal predicates and as arguments to table functions like parquet_scan(...) (e.g., binding the file path / glob at execution time). [1]
SQL (CLI / SQL API)
PREPARE q AS
SELECT *
FROM parquet_scan($1)
WHERE some_col >= $2;
EXECUTE q('data/**/*.parquet', 10);DuckDB also supports named parameters in prepared statements. [1]
Python (DB-API)
DuckDB’s Python execute() accepts a parameters= argument and will run the statement as a prepared statement with bound values. [2][3]
Positional (? or $1):
import duckdb
con = duckdb.connect()
path = "data/**/*.parquet"
min_val = 10
df = con.execute(
"SELECT * FROM parquet_scan(?) WHERE some_col >= ?",
[path, min_val],
).fetch_df()Named ($param):
df = con.execute(
"SELECT * FROM parquet_scan($path) WHERE some_col >= $min_val",
{"path": path, "min_val": min_val},
).fetch_df()Important limitation
Bind parameters substitute values, not identifiers—so you can’t parameterize things like column names or ORDER BY <identifier> directly; for those you must construct SQL text (carefully) or use other approaches. [1]
Sources: DuckDB prepared statements syntax (SQL) [1]; DuckDB Python DB-API parameter binding [2]; DuckDB Python execute(query, parameters=None) reference [3].
Parameterize parquet_scan to avoid SQL injection and quoting issues.
Direct string formatting with a user-supplied path can break on quotes and enable SQL injection. Use bound parameters instead.
🔒 Proposed fix
- self.parquet_db.execute(
- "CREATE VIEW parquet_db_raw AS SELECT * FROM parquet_scan('{}')".format(
- database_path
- )
- )
+ self.parquet_db.execute(
+ "CREATE VIEW parquet_db_raw AS SELECT * FROM parquet_scan(?)",
+ [database_path],
+ )📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| # Create raw view from parquet | |
| self.parquet_db.execute( | |
| "CREATE VIEW parquet_db_raw AS SELECT * FROM parquet_scan('{}')".format( | |
| database_path | |
| ) | |
| # Create raw view from parquet | |
| self.parquet_db.execute( | |
| "CREATE VIEW parquet_db_raw AS SELECT * FROM parquet_scan(?)", | |
| [database_path], | |
| ) |
🧰 Tools
🪛 Ruff (0.14.14)
517-519: Possible SQL injection vector through string-based query construction
(S608)
🤖 Prompt for AI Agents
In `@mokume/normalization/peptide.py` around lines 515 - 519, The CREATE VIEW call
directly injects database_path into the SQL string via format, which can break
on quotes or permit SQL injection; update the call that uses
self.parquet_db.execute so the parquet_scan path is passed as a bound parameter
instead of string-formatted in the SQL (i.e. replace the formatted "SELECT *
FROM parquet_scan('{}')" usage with a parameter placeholder and pass
database_path as the parameter to execute). Ensure you modify the code that
constructs the SQL (the parquet_scan invocation inside the CREATE VIEW
statement) and run the query using the DB API's parameter binding mechanism,
then test with paths containing quotes and special chars to confirm correct
behavior.
| for sample, factor in med_map.items(): | ||
| assert factor > 0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🧩 Analysis chain
🏁 Script executed:
fd test_peptide_normalize.pyRepository: bigbio/mokume
Length of output: 88
🏁 Script executed:
#!/bin/bash
# Check if the file exists and show lines around 116-117, 152-153, 161-162
if [ -f tests/test_peptide_normalize.py ]; then
echo "=== Lines 110-125 (around 116-117) ==="
sed -n '110,125p' tests/test_peptide_normalize.py | cat -n
echo ""
echo "=== Lines 145-160 (around 152-153) ==="
sed -n '145,160p' tests/test_peptide_normalize.py | cat -n
echo ""
echo "=== Lines 155-170 (around 161-162) ==="
sed -n '155,170p' tests/test_peptide_normalize.py | cat -n
else
echo "File not found"
fiRepository: bigbio/mokume
Length of output: 2494
🏁 Script executed:
# Search for usage of variables 'sample', 'condition', 'samples' in the context of the flagged lines
cat tests/test_peptide_normalize.py | head -165 | tail -60Repository: bigbio/mokume
Length of output: 2495
Remove unused variables that trigger linter warnings.
Three locations have loop or assignment variables that are never referenced:
- Line 116:
sampleparameter is unused in theforloop - Line 152:
conditionparameter is unused in theforloop - Line 161:
samplesvariable is assigned but never used afterward
Replace the unused loop parameters with _ and remove the unused assignment to keep code clean.
Changes needed
@@
- for sample, factor in med_map.items():
+ for _, factor in med_map.items():
assert factor > 0
@@
- for condition, samples in med_map.items():
+ for _, samples in med_map.items():
assert isinstance(samples, dict)
@@
- samples = feature.get_unique_samples()📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| for sample, factor in med_map.items(): | |
| assert factor > 0 | |
| for _, factor in med_map.items(): | |
| assert factor > 0 |
🧰 Tools
🪛 Ruff (0.14.14)
116-116: Loop control variable sample not used within loop body
Rename unused sample to _sample
(B007)
🤖 Prompt for AI Agents
In `@tests/test_peptide_normalize.py` around lines 116 - 117, The linter warnings
come from unused variables in the test loops and an unused assignment; in the
loop iterating over med_map replace the unused loop variable `sample` with `_`,
similarly replace the unused `condition` loop variable with `_`, and remove the
unused `samples` assignment (or stop assigning to it) so only used variables
remain; update occurrences near the med_map loop and the two other loops that
reference `condition` and `samples` in tests/test_peptide_normalize.py (keep the
loop structure and assertions intact).
| "output": str(TESTS_DIR / "example" / "out" / "PXD020192-no-sdrf.csv"), | ||
| "skip_normalization": False, | ||
| "nmethod": "median", | ||
| "pnmethod": "globalMedian", | ||
| "log2": True, | ||
| "save_parquet": False, | ||
| } | ||
|
|
||
| out = Path(args["output"]) | ||
| if out.exists(): | ||
| out.unlink() | ||
|
|
||
| # Should complete without errors | ||
| peptide_normalization(**args) | ||
|
|
||
| # Output should exist | ||
| assert out.exists() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Create the output directory before writing to avoid FileNotFoundError.
If tests/example/out doesn’t exist in a clean checkout, these tests will fail when WriteCSVTask tries to open the file.
🛠️ Proposed fix
@@
out = Path(args["output"])
+ out.parent.mkdir(parents=True, exist_ok=True)
if out.exists():
out.unlink()
@@
out = Path(args["output"])
+ out.parent.mkdir(parents=True, exist_ok=True)
if out.exists():
out.unlink()
@@
out = Path(args["output"])
+ out.parent.mkdir(parents=True, exist_ok=True)
if out.exists():
out.unlink()Also applies to: 230-247, 273-290
🤖 Prompt for AI Agents
In `@tests/test_peptide_normalize.py` around lines 198 - 214, The test fails if
the output parent directory doesn't exist: before calling
peptide_normalization(**args) (where args["output"] is used and WriteCSVTask
writes the file), ensure the parent directory of Path(args["output"]) is created
(e.g., out.parent.mkdir(parents=True, exist_ok=True)) so the file can be
written; apply this same fix to the other similar test blocks in
tests/test_peptide_normalize.py that construct out and call
peptide_normalization.
Summary by CodeRabbit
New Features
Improvements
✏️ Tip: You can customize this high-level summary in your review settings.