Skip to content

Convert CanadianInvertebrates-ML to Parquet and drop datasets<4 pin #29

@gwtaylor

Description

@gwtaylor

Problem

The HuggingFace dataset bioscan-ml/CanadianInvertebrates-ML uses a legacy custom loading script (CanadianInvertebrates-ML.py). The datasets library v4.0.0 removed support for loading scripts entirely, which means:

PR #22 noted this as a longer-term follow-up.

Proposed solution

Convert the dataset on HuggingFace Hub from the legacy loading script format to Parquet:

  1. Load the dataset locally using datasets<4 (the current pinned version works)
  2. Save per-split Parquet files to a data/ directory (e.g., data/train.parquet, data/validation.parquet, data/test.parquet, data/test_unseen.parquet, data/pretrain.parquet)
  3. Update the dataset card (README.md) with YAML configs metadata mapping splits to Parquet files
  4. Remove the loading script and CSV from the HF repo — CanadianInvertebrates-ML.py is a custom datasets.GeneratorBasedBuilder subclass that lives only in the HF dataset repo (not in this GitHub repo); it tells the datasets library how to split and yield rows from CanInv_metadata.csv (676 MB, contains both DNA barcode sequences and taxonomic metadata despite the name). Once the data is in per-split Parquet files with YAML config, both the script and the monolithic CSV are redundant.
  5. Verify the Dataset Viewer works and load_dataset("bioscan-ml/CanadianInvertebrates-ML") succeeds with datasets>=4 (no trust_remote_code needed)

Changes in this repo (BarcodeBERT) after the HF dataset is converted

  • Relax the datasets>=2.16,<4 pin in pyproject.toml (the <4 upper bound becomes unnecessary; check whether other constraints like torchtext still require it)
  • Update data/download_HF_CanInv.py to remove trust_remote_code=True and the explanatory comment
  • Update CONTRIBUTING.md dependency notes if applicable

Rollback plan

HF dataset repos are git repos. If the conversion causes problems:

git clone git@hf.co:datasets/bioscan-ml/CanadianInvertebrates-ML
cd CanadianInvertebrates-ML
git log  # find the pre-conversion commit
git revert <commit-sha>
git push

Data fidelity

The loading script does no transformation on the data -- it reads CSV rows via csv.DictReader and yields them directly. DNA barcode sequences are plain ASCII strings (A, C, G, T, N). Parquet handles these losslessly.

Verification should compare old and new formats across all splits:

  • Row counts match per split
  • Column names and dtypes preserved
  • Full content hash (or sorted diff) of each split to confirm byte-level equivalence of string values
  • Spot-check a sample of DNA barcode sequences for correctness

These checks can be parallelised across splits using subagents.

References

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions