Skip to content

Conversation

@ShigrafS
Copy link

@ShigrafS ShigrafS commented Feb 26, 2025

Fixes: #209

Original Issues: #142 & #124

Related Issues:

Overview

This pull request improves the read_chromsizes function to catch formatting errors in chromsizes files early and provide clear, actionable error messages. Previously, issues like spaces instead of tabs, hidden characters, or malformed rows could slip through, causing confusing downstream errors (e.g., ValueError: cannot convert float NaN to integer). Now, the function validates the file format upfront, ensuring it’s tab-delimited, has exactly two columns, and contains valid integer lengths—making it more robust and user-friendly.

What Was Happening Before?

  • The Problem: Chromsizes files with formatting issues (e.g., spaces instead of tabs, extra columns, or non-numeric lengths) were parsed silently by pandas.read_csv. This led to NaN values in the length column, which crashed later steps like binning with vague errors.
  • Example Error:
    cooler cload pairix --nproc 9 --assembly gal5 gal5Allele.chrom.sizes:1000 MNP-DT40-1-3-3-R1-T1__gal5.nodups.pairs.gz MNP-DT40-1-3-3-R1-T1__gal5.1000.cool
    Traceback (most recent call last):
      ...
      ValueError: cannot convert float NaN to integer
    
  • Why It Happened:
    • The original code didn’t check the file’s format before processing. For instance, a file like this:
      chr1 1000000   extra_column
      chr2\t2000000
      
      would parse 1000000 extra_column as a single value, resulting in NaN for length. Similarly, spaces instead of tabs (e.g., chr1 1000000) caused misparsing.

What’s Changed?

This update adds proactive checks to read_chromsizes to catch these issues right away. Here’s what’s new:

  1. Strict Tab Enforcement:

    • Before reading the file, we peek at the first line. If it contains spaces, we raise an error like:
      ValueError: Chromsizes file 'gal5Allele.chrom.sizes' uses spaces instead of tabs as delimiters. Please use tabs.
      
    • This fixes add input validation for util.read_chromsizes  #124 by ensuring only tab-separated files are accepted.
  2. Exact Two-Column Validation:

    • We use pandas.read_csv with on_bad_lines="error", which rejects files with too few or too many columns (e.g., chr1\t1000000\textra or chr1). This prevents silent misparsing.
  3. Numeric Length Check:

    • After loading the file, we convert the length column to numbers with pd.to_numeric(errors="coerce"). If any values turn into NaN (e.g., due to text like allele1 or hidden characters), we raise a detailed error:
      ValueError: Chromsizes file 'gal5Allele.chrom.sizes' contains missing or invalid length values. Please ensure that the file is properly formatted as tab-delimited with two columns: sequence name and integer length. Check for extraneous spaces or hidden characters. Invalid rows:
        name    length
        chrX    NaN
      
    • This fixes cryptic error message when chromsizes is not formatted properly #142 by replacing cryptic errors with something clear and helpful.

How It Works Now

  • Good File:

    chr1\t1000000
    chr2\t2000000
    

    → Works perfectly, returns a pd.Series with lengths indexed by chromosome names.

  • Bad File with Spaces:

    chr1 1000000
    chr2 2000000
    

    → Fails early: ValueError: Chromsizes file uses spaces instead of tabs...

  • Bad File with Invalid Lengths:

    chr1\t1000000
    chr2\tnot_a_number
    

    → Fails with: ValueError: Chromsizes file contains invalid length values... Invalid rows: chr2 NaN

  • Bad File with Extra Columns:

    chr1\t1000000\textra
    

    → Fails with a pandas parsing error about mismatched columns.

Benefits

  • Early Detection: Catches errors before they cause downstream crashes.
  • Clear Feedback: Tells users exactly what’s wrong and how to fix it (e.g., “use tabs”, “check for invalid lengths”).
  • Robustness: Handles a wider range of formatting mistakes, like spaces, hidden characters, or extra columns.

Notes

  • This PR doesn’t add a verbose option, as per maintainer feedback—it’s not needed here.
  • Future tweaks (e.g., checking lengths are positive or sampling more lines for spaces) are noted but deferred for later.

Testing

  • Tested with:
    • Valid tab-delimited files.
    • Files with spaces instead of tabs.
    • Files with non-numeric lengths or hidden characters.
    • Files with extra or missing columns.

This update makes cooler more reliable and easier to use by catching chromsizes issues upfront with clear guidance for users.


@ShigrafS ShigrafS changed the title Improve chromsizes file validation to catch formatting errors early (… Improve chromsizes File Validation to Catch Formatting Errors Early Feb 26, 2025
@nvictus
Copy link
Member

nvictus commented Feb 26, 2025

Thank you for the contribution @ShigrafS! Would you mind adding a simple unit test that confirms the exception gets raised with bad input? You can use a broken version of toy.chrom.sizes.

@ShigrafS
Copy link
Author

@nvictus Sure, I'll do that and let you know.

@ShigrafS
Copy link
Author

ShigrafS commented Mar 1, 2025

@nvictus
I have added the unit test and made some minor tweaks as well.
Kindly look into it.

@ShigrafS
Copy link
Author

ShigrafS commented Mar 5, 2025

@nvictus I've made all the required changes.
Kindly look into it.

@ShigrafS
Copy link
Author

@nvictus The PR is ready to be merged.

@vedatonuryilmaz
Copy link

@nvictus just flagging.

@ShigrafS
Copy link
Author

@nvictus This PR is ready to be merged.
Kindly review it.

@ShigrafS ShigrafS requested a review from nvictus April 12, 2025 13:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Better input TSV validation

3 participants