Convert CanadianInvertebrates-ML to Parquet and drop datasets<4 pin

## Problem

The HuggingFace dataset `bioscan-ml/CanadianInvertebrates-ML` uses a legacy custom loading script (`CanadianInvertebrates-ML.py`). The `datasets` library v4.0.0 removed support for loading scripts entirely, which means:

- Users with `datasets>=4` get `RuntimeError: Dataset scripts are no longer supported` (#21, fixed in #22 by pinning `datasets<4`)
- The HF Dataset Viewer is broken (the HF server itself runs `datasets` v4+)
- The HF parquet auto-converter bot posted a boilerplate message in [Discussion #1](https://huggingface.co/datasets/bioscan-ml/CanadianInvertebrates-ML/discussions/1) claiming it created a Parquet version on `refs/convert/parquet`, but the branch does not actually exist (the [HF refs API](https://huggingface.co/api/datasets/bioscan-ml/CanadianInvertebrates-ML/refs) returns an empty `converts` array and the [tree](https://huggingface.co/datasets/bioscan-ml/CanadianInvertebrates-ML/tree/refs%2Fconvert%2Fparquet) returns 404). The conversion likely failed silently because the HF server runs `datasets` v4+ and cannot execute the loading script.
- Tools like Polars, DuckDB, and pandas cannot access the dataset via `hf://` URLs

PR #22 noted this as a longer-term follow-up.

## Proposed solution

Convert the dataset on HuggingFace Hub from the legacy loading script format to Parquet:

1. **Load the dataset locally** using `datasets<4` (the current pinned version works)
2. **Save per-split Parquet files** to a `data/` directory (e.g., `data/train.parquet`, `data/validation.parquet`, `data/test.parquet`, `data/test_unseen.parquet`, `data/pretrain.parquet`)
3. **Update the dataset card** (`README.md`) with YAML `configs` metadata mapping splits to Parquet files
4. **Remove the loading script and CSV** from the HF repo — `CanadianInvertebrates-ML.py` is a custom `datasets.GeneratorBasedBuilder` subclass that lives only in the HF dataset repo (not in this GitHub repo); it tells the `datasets` library how to split and yield rows from `CanInv_metadata.csv` (676 MB, contains both DNA barcode sequences and taxonomic metadata despite the name). Once the data is in per-split Parquet files with YAML config, both the script and the monolithic CSV are redundant.
5. **Verify** the Dataset Viewer works and `load_dataset("bioscan-ml/CanadianInvertebrates-ML")` succeeds with `datasets>=4` (no `trust_remote_code` needed)

### Changes in this repo (BarcodeBERT) after the HF dataset is converted

- Relax the `datasets>=2.16,<4` pin in `pyproject.toml` (the `<4` upper bound becomes unnecessary; check whether other constraints like `torchtext` still require it)
- Update `data/download_HF_CanInv.py` to remove `trust_remote_code=True` and the explanatory comment
- Update `CONTRIBUTING.md` dependency notes if applicable

## Rollback plan

HF dataset repos are git repos. If the conversion causes problems:

```bash
git clone git@hf.co:datasets/bioscan-ml/CanadianInvertebrates-ML
cd CanadianInvertebrates-ML
git log  # find the pre-conversion commit
git revert <commit-sha>
git push
```

## Data fidelity

The loading script does no transformation on the data -- it reads CSV rows via `csv.DictReader` and yields them directly. DNA barcode sequences are plain ASCII strings (A, C, G, T, N). Parquet handles these losslessly.

Verification should compare old and new formats across all splits:

- **Row counts** match per split
- **Column names and dtypes** preserved
- **Full content hash** (or sorted diff) of each split to confirm byte-level equivalence of string values
- **Spot-check** a sample of DNA barcode sequences for correctness

These checks can be parallelised across splits using subagents.

## References

- #21 -- original bug report (`datasets` v4 incompatibility)
- #22 -- fix that pinned `datasets<4` (merged)
- [HF Discussion #1](https://huggingface.co/datasets/bioscan-ml/CanadianInvertebrates-ML/discussions/1) -- parquet-converter bot (failed, `refs/convert/parquet` branch never created)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Convert CanadianInvertebrates-ML to Parquet and drop datasets<4 pin #29

Problem

Proposed solution

Changes in this repo (BarcodeBERT) after the HF dataset is converted

Rollback plan

Data fidelity

References

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Convert CanadianInvertebrates-ML to Parquet and drop datasets<4 pin #29

Description

Problem

Proposed solution

Changes in this repo (BarcodeBERT) after the HF dataset is converted

Rollback plan

Data fidelity

References

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions