-
Notifications
You must be signed in to change notification settings - Fork 58
Improve chromsizes File Validation to Catch Formatting Errors Early #458
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
|
Thank you for the contribution @ShigrafS! Would you mind adding a simple unit test that confirms the exception gets raised with bad input? You can use a broken version of |
|
@nvictus Sure, I'll do that and let you know. |
…o rea-chromsize in util.py
for more information, see https://pre-commit.ci
|
@nvictus |
Co-authored-by: Nezar Abdennur <[email protected]>
for more information, see https://pre-commit.ci
|
@nvictus I've made all the required changes. |
|
@nvictus The PR is ready to be merged. |
|
@nvictus just flagging. |
|
@nvictus This PR is ready to be merged. |
Fixes: #209
Original Issues: #142 & #124
Related Issues:
Overview
This pull request improves the
read_chromsizesfunction to catch formatting errors in chromsizes files early and provide clear, actionable error messages. Previously, issues like spaces instead of tabs, hidden characters, or malformed rows could slip through, causing confusing downstream errors (e.g.,ValueError: cannot convert float NaN to integer). Now, the function validates the file format upfront, ensuring it’s tab-delimited, has exactly two columns, and contains valid integer lengths—making it more robust and user-friendly.What Was Happening Before?
pandas.read_csv. This led toNaNvalues in thelengthcolumn, which crashed later steps like binning with vague errors.1000000 extra_columnas a single value, resulting inNaNforlength. Similarly, spaces instead of tabs (e.g.,chr1 1000000) caused misparsing.What’s Changed?
This update adds proactive checks to
read_chromsizesto catch these issues right away. Here’s what’s new:Strict Tab Enforcement:
Exact Two-Column Validation:
pandas.read_csvwithon_bad_lines="error", which rejects files with too few or too many columns (e.g.,chr1\t1000000\textraorchr1). This prevents silent misparsing.Numeric Length Check:
lengthcolumn to numbers withpd.to_numeric(errors="coerce"). If any values turn intoNaN(e.g., due to text likeallele1or hidden characters), we raise a detailed error:How It Works Now
Good File:
→ Works perfectly, returns a
pd.Serieswith lengths indexed by chromosome names.Bad File with Spaces:
→ Fails early:
ValueError: Chromsizes file uses spaces instead of tabs...Bad File with Invalid Lengths:
→ Fails with:
ValueError: Chromsizes file contains invalid length values... Invalid rows: chr2 NaNBad File with Extra Columns:
→ Fails with a
pandasparsing error about mismatched columns.Benefits
Notes
verboseoption, as per maintainer feedback—it’s not needed here.Testing
This update makes
coolermore reliable and easier to use by catching chromsizes issues upfront with clear guidance for users.