Problem
When syncing taxonomy between environments (production → demo, or between instances), there's no built-in way to export/import the manually-curated taxonomy data. The current process requires ad-hoc Django shell commands and direct SQL.
The CSV bulk import (import_taxa) handles genera and species from Google Sheets, but doesn't cover:
- Special classification labels created by ML models:
Not Identifiable, Not Lepidoptera, Not Arthropoda
- Common name search terms (
search_names) added manually to orders and key species (e.g. Diptera → ['Flies and Mosquitoes', 'fly'])
- Display names and other metadata adjustments
Proposal
Two management commands: export_taxa and import_taxa_json (to avoid collision with existing import_taxa).
export_taxa
Exports taxa with non-default data (search_names, display_name overrides, special ranks) to JSON.
# Export all taxa with search_names set
python manage.py export_taxa --output taxa_sync.json
# Export specific taxa lists or ranks
python manage.py export_taxa --ranks ORDER,PHYLUM,UNKNOWN --output orders.json
python manage.py export_taxa --has-search-names --output searchable.json
Output format:
{
"version": 1,
"exported_at": "2026-03-24T12:00:00Z",
"source": "antenna.insectai.org",
"taxa": [
{
"name": "Lepidoptera",
"rank": "ORDER",
"display_name": "Lepidoptera",
"search_names": ["Butterflies and Moths", "moth"],
"common_name_en": null,
"parent_name": "Insecta",
"parent_rank": "CLASS",
"active": true
},
{
"name": "Not Identifiable",
"rank": "UNKNOWN",
"display_name": "Not Identifiable",
"search_names": [],
"parent_name": null,
"parent_rank": null,
"active": true
}
]
}
Key: identify taxa by (name, rank) pair, parent by (parent_name, parent_rank). Don't export PKs — they differ between environments.
import_taxa_json
Upserts from the export JSON. Match by (name, rank), create if missing, update fields if changed.
# Preview what would change (dry run)
python manage.py import_taxa_json taxa_sync.json --dry-run
# Apply
python manage.py import_taxa_json taxa_sync.json
# Only update search_names, don't create new taxa
python manage.py import_taxa_json taxa_sync.json --update-only --fields search_names
Dry run output:
Would CREATE: Not Arthropoda (PHYLUM) - search_names: ['Not an invertebrates...']
Would UPDATE: Diptera (ORDER) - search_names: ['Flies and Mosquitoes'] → ['Flies and Mosquitoes', 'fly']
Would SKIP: Lepidoptera (ORDER) - no changes
3 taxa in file, 1 create, 1 update, 1 skip
Implementation notes
Matching by (name, rank) — the Plecoptera problem
There's a real case where name is ambiguous: Plecoptera exists as both a moth GENUS and an insect ORDER. Production disambiguates with Plecoptera (Order) for the order. The export/import should use (name, rank) as the composite key, which handles this naturally.
For duplicate names within the same rank (shouldn't happen but could with data bugs), the import should warn and skip rather than silently pick one.
Fields to sync
Export these fields (skip if they match defaults/empty):
search_names (ArrayField) — the primary use case
display_name — usually matches name but can be overridden
common_name_en — English common name
active — soft-delete flag
parent — by name+rank reference
Don't export: pk, created_at, updated_at, parents_json (derived), ordering, sort_phylogeny, gbif_taxon_key, inat_taxon_id (external IDs are environment-specific).
Gotchas from manual sync experience
-
search_names is a Postgres ArrayField — empty is [] not None. Some taxa have None, some have []. Normalize on export.
-
ML model labels create duplicate taxa — models output labels like moth, nonmoth which auto-create taxa if they don't match existing names. The proper fix is merge_taxa (PR branch feat/merge-taxa-command), but the export/import should handle the state where duplicates exist (export both, let admin decide which to keep).
-
add_genus_parents can misparent — if an order shares a name with a genus (Plecoptera), species get parented to the wrong taxon. The import shouldn't touch parent relationships for species — those come from add_genus_parents.
-
DB connection drops on external Postgres — long-running operations against an external DB (like demo's setup) can timeout. The import should commit in batches rather than one big transaction.
-
Not Identifiable rank is UNKNOWN — not a standard taxonomic rank but used by the ML pipeline. The export/import should preserve non-standard ranks.
Alternatives considered
- pg_dump/pg_restore of just the taxa table — too blunt, includes PKs and foreign key issues
- Django fixtures (dumpdata/loaddata) — includes PKs, doesn't handle upsert
- Extending
import_taxa CSV format — CSV doesn't handle array fields (search_names) cleanly
Problem
When syncing taxonomy between environments (production → demo, or between instances), there's no built-in way to export/import the manually-curated taxonomy data. The current process requires ad-hoc Django shell commands and direct SQL.
The CSV bulk import (
import_taxa) handles genera and species from Google Sheets, but doesn't cover:Not Identifiable,Not Lepidoptera,Not Arthropodasearch_names) added manually to orders and key species (e.g.Diptera→['Flies and Mosquitoes', 'fly'])Proposal
Two management commands:
export_taxaandimport_taxa_json(to avoid collision with existingimport_taxa).export_taxaExports taxa with non-default data (search_names, display_name overrides, special ranks) to JSON.
Output format:
{ "version": 1, "exported_at": "2026-03-24T12:00:00Z", "source": "antenna.insectai.org", "taxa": [ { "name": "Lepidoptera", "rank": "ORDER", "display_name": "Lepidoptera", "search_names": ["Butterflies and Moths", "moth"], "common_name_en": null, "parent_name": "Insecta", "parent_rank": "CLASS", "active": true }, { "name": "Not Identifiable", "rank": "UNKNOWN", "display_name": "Not Identifiable", "search_names": [], "parent_name": null, "parent_rank": null, "active": true } ] }Key: identify taxa by
(name, rank)pair, parent by(parent_name, parent_rank). Don't export PKs — they differ between environments.import_taxa_jsonUpserts from the export JSON. Match by
(name, rank), create if missing, update fields if changed.Dry run output:
Implementation notes
Matching by
(name, rank)— the Plecoptera problemThere's a real case where
nameis ambiguous:Plecopteraexists as both a moth GENUS and an insect ORDER. Production disambiguates withPlecoptera (Order)for the order. The export/import should use(name, rank)as the composite key, which handles this naturally.For duplicate names within the same rank (shouldn't happen but could with data bugs), the import should warn and skip rather than silently pick one.
Fields to sync
Export these fields (skip if they match defaults/empty):
search_names(ArrayField) — the primary use casedisplay_name— usually matchesnamebut can be overriddencommon_name_en— English common nameactive— soft-delete flagparent— by name+rank referenceDon't export:
pk,created_at,updated_at,parents_json(derived),ordering,sort_phylogeny,gbif_taxon_key,inat_taxon_id(external IDs are environment-specific).Gotchas from manual sync experience
search_namesis a Postgres ArrayField — empty is[]notNone. Some taxa haveNone, some have[]. Normalize on export.ML model labels create duplicate taxa — models output labels like
moth,nonmothwhich auto-create taxa if they don't match existing names. The proper fix ismerge_taxa(PR branchfeat/merge-taxa-command), but the export/import should handle the state where duplicates exist (export both, let admin decide which to keep).add_genus_parentscan misparent — if an order shares a name with a genus (Plecoptera), species get parented to the wrong taxon. The import shouldn't touch parent relationships for species — those come fromadd_genus_parents.DB connection drops on external Postgres — long-running operations against an external DB (like demo's setup) can timeout. The import should commit in batches rather than one big transaction.
Not Identifiablerank isUNKNOWN— not a standard taxonomic rank but used by the ML pipeline. The export/import should preserve non-standard ranks.Alternatives considered
import_taxaCSV format — CSV doesn't handle array fields (search_names) cleanly