Problem
The HuggingFace dataset bioscan-ml/CanadianInvertebrates-ML uses a legacy custom loading script (CanadianInvertebrates-ML.py). The datasets library v4.0.0 removed support for loading scripts entirely, which means:
PR #22 noted this as a longer-term follow-up.
Proposed solution
Convert the dataset on HuggingFace Hub from the legacy loading script format to Parquet:
- Load the dataset locally using
datasets<4 (the current pinned version works)
- Save per-split Parquet files to a
data/ directory (e.g., data/train.parquet, data/validation.parquet, data/test.parquet, data/test_unseen.parquet, data/pretrain.parquet)
- Update the dataset card (
README.md) with YAML configs metadata mapping splits to Parquet files
- Remove the loading script and CSV from the HF repo —
CanadianInvertebrates-ML.py is a custom datasets.GeneratorBasedBuilder subclass that lives only in the HF dataset repo (not in this GitHub repo); it tells the datasets library how to split and yield rows from CanInv_metadata.csv (676 MB, contains both DNA barcode sequences and taxonomic metadata despite the name). Once the data is in per-split Parquet files with YAML config, both the script and the monolithic CSV are redundant.
- Verify the Dataset Viewer works and
load_dataset("bioscan-ml/CanadianInvertebrates-ML") succeeds with datasets>=4 (no trust_remote_code needed)
Changes in this repo (BarcodeBERT) after the HF dataset is converted
- Relax the
datasets>=2.16,<4 pin in pyproject.toml (the <4 upper bound becomes unnecessary; check whether other constraints like torchtext still require it)
- Update
data/download_HF_CanInv.py to remove trust_remote_code=True and the explanatory comment
- Update
CONTRIBUTING.md dependency notes if applicable
Rollback plan
HF dataset repos are git repos. If the conversion causes problems:
git clone git@hf.co:datasets/bioscan-ml/CanadianInvertebrates-ML
cd CanadianInvertebrates-ML
git log # find the pre-conversion commit
git revert <commit-sha>
git push
Data fidelity
The loading script does no transformation on the data -- it reads CSV rows via csv.DictReader and yields them directly. DNA barcode sequences are plain ASCII strings (A, C, G, T, N). Parquet handles these losslessly.
Verification should compare old and new formats across all splits:
- Row counts match per split
- Column names and dtypes preserved
- Full content hash (or sorted diff) of each split to confirm byte-level equivalence of string values
- Spot-check a sample of DNA barcode sequences for correctness
These checks can be parallelised across splits using subagents.
References
Problem
The HuggingFace dataset
bioscan-ml/CanadianInvertebrates-MLuses a legacy custom loading script (CanadianInvertebrates-ML.py). Thedatasetslibrary v4.0.0 removed support for loading scripts entirely, which means:datasets>=4getRuntimeError: Dataset scripts are no longer supported(RuntimeError: Dataset scripts are no longer supported when running download_HF_CanInv.py with datasets v4.5.0 #21, fixed in fix: pin dependency versions for Python and datasets compatibility #22 by pinningdatasets<4)datasetsv4+)refs/convert/parquet, but the branch does not actually exist (the HF refs API returns an emptyconvertsarray and the tree returns 404). The conversion likely failed silently because the HF server runsdatasetsv4+ and cannot execute the loading script.hf://URLsPR #22 noted this as a longer-term follow-up.
Proposed solution
Convert the dataset on HuggingFace Hub from the legacy loading script format to Parquet:
datasets<4(the current pinned version works)data/directory (e.g.,data/train.parquet,data/validation.parquet,data/test.parquet,data/test_unseen.parquet,data/pretrain.parquet)README.md) with YAMLconfigsmetadata mapping splits to Parquet filesCanadianInvertebrates-ML.pyis a customdatasets.GeneratorBasedBuildersubclass that lives only in the HF dataset repo (not in this GitHub repo); it tells thedatasetslibrary how to split and yield rows fromCanInv_metadata.csv(676 MB, contains both DNA barcode sequences and taxonomic metadata despite the name). Once the data is in per-split Parquet files with YAML config, both the script and the monolithic CSV are redundant.load_dataset("bioscan-ml/CanadianInvertebrates-ML")succeeds withdatasets>=4(notrust_remote_codeneeded)Changes in this repo (BarcodeBERT) after the HF dataset is converted
datasets>=2.16,<4pin inpyproject.toml(the<4upper bound becomes unnecessary; check whether other constraints liketorchtextstill require it)data/download_HF_CanInv.pyto removetrust_remote_code=Trueand the explanatory commentCONTRIBUTING.mddependency notes if applicableRollback plan
HF dataset repos are git repos. If the conversion causes problems:
Data fidelity
The loading script does no transformation on the data -- it reads CSV rows via
csv.DictReaderand yields them directly. DNA barcode sequences are plain ASCII strings (A, C, G, T, N). Parquet handles these losslessly.Verification should compare old and new formats across all splits:
These checks can be parallelised across splits using subagents.
References
datasetsv4 incompatibility)datasets<4(merged)refs/convert/parquetbranch never created)