Skip to content

Management command for taxonomy export/import (cross-environment sync) #1187

@mihow

Description

@mihow

Problem

When syncing taxonomy between environments (production → demo, or between instances), there's no built-in way to export/import the manually-curated taxonomy data. The current process requires ad-hoc Django shell commands and direct SQL.

The CSV bulk import (import_taxa) handles genera and species from Google Sheets, but doesn't cover:

  • Special classification labels created by ML models: Not Identifiable, Not Lepidoptera, Not Arthropoda
  • Common name search terms (search_names) added manually to orders and key species (e.g. Diptera['Flies and Mosquitoes', 'fly'])
  • Display names and other metadata adjustments

Proposal

Two management commands: export_taxa and import_taxa_json (to avoid collision with existing import_taxa).

export_taxa

Exports taxa with non-default data (search_names, display_name overrides, special ranks) to JSON.

# Export all taxa with search_names set
python manage.py export_taxa --output taxa_sync.json

# Export specific taxa lists or ranks
python manage.py export_taxa --ranks ORDER,PHYLUM,UNKNOWN --output orders.json
python manage.py export_taxa --has-search-names --output searchable.json

Output format:

{
  "version": 1,
  "exported_at": "2026-03-24T12:00:00Z",
  "source": "antenna.insectai.org",
  "taxa": [
    {
      "name": "Lepidoptera",
      "rank": "ORDER",
      "display_name": "Lepidoptera",
      "search_names": ["Butterflies and Moths", "moth"],
      "common_name_en": null,
      "parent_name": "Insecta",
      "parent_rank": "CLASS",
      "active": true
    },
    {
      "name": "Not Identifiable",
      "rank": "UNKNOWN",
      "display_name": "Not Identifiable",
      "search_names": [],
      "parent_name": null,
      "parent_rank": null,
      "active": true
    }
  ]
}

Key: identify taxa by (name, rank) pair, parent by (parent_name, parent_rank). Don't export PKs — they differ between environments.

import_taxa_json

Upserts from the export JSON. Match by (name, rank), create if missing, update fields if changed.

# Preview what would change (dry run)
python manage.py import_taxa_json taxa_sync.json --dry-run

# Apply
python manage.py import_taxa_json taxa_sync.json

# Only update search_names, don't create new taxa
python manage.py import_taxa_json taxa_sync.json --update-only --fields search_names

Dry run output:

Would CREATE: Not Arthropoda (PHYLUM) - search_names: ['Not an invertebrates...']
Would UPDATE: Diptera (ORDER) - search_names: ['Flies and Mosquitoes'] → ['Flies and Mosquitoes', 'fly']
Would SKIP: Lepidoptera (ORDER) - no changes
3 taxa in file, 1 create, 1 update, 1 skip

Implementation notes

Matching by (name, rank) — the Plecoptera problem

There's a real case where name is ambiguous: Plecoptera exists as both a moth GENUS and an insect ORDER. Production disambiguates with Plecoptera (Order) for the order. The export/import should use (name, rank) as the composite key, which handles this naturally.

For duplicate names within the same rank (shouldn't happen but could with data bugs), the import should warn and skip rather than silently pick one.

Fields to sync

Export these fields (skip if they match defaults/empty):

  • search_names (ArrayField) — the primary use case
  • display_name — usually matches name but can be overridden
  • common_name_en — English common name
  • active — soft-delete flag
  • parent — by name+rank reference

Don't export: pk, created_at, updated_at, parents_json (derived), ordering, sort_phylogeny, gbif_taxon_key, inat_taxon_id (external IDs are environment-specific).

Gotchas from manual sync experience

  1. search_names is a Postgres ArrayField — empty is [] not None. Some taxa have None, some have []. Normalize on export.

  2. ML model labels create duplicate taxa — models output labels like moth, nonmoth which auto-create taxa if they don't match existing names. The proper fix is merge_taxa (PR branch feat/merge-taxa-command), but the export/import should handle the state where duplicates exist (export both, let admin decide which to keep).

  3. add_genus_parents can misparent — if an order shares a name with a genus (Plecoptera), species get parented to the wrong taxon. The import shouldn't touch parent relationships for species — those come from add_genus_parents.

  4. DB connection drops on external Postgres — long-running operations against an external DB (like demo's setup) can timeout. The import should commit in batches rather than one big transaction.

  5. Not Identifiable rank is UNKNOWN — not a standard taxonomic rank but used by the ML pipeline. The export/import should preserve non-standard ranks.

Alternatives considered

  • pg_dump/pg_restore of just the taxa table — too blunt, includes PKs and foreign key issues
  • Django fixtures (dumpdata/loaddata) — includes PKs, doesn't handle upsert
  • Extending import_taxa CSV format — CSV doesn't handle array fields (search_names) cleanly

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions