Skip to content

feat: add Darwin Core Archive (DwC-A) export format#1131

Draft
mihow wants to merge 15 commits intomainfrom
feat/dwca-export
Draft

feat: add Darwin Core Archive (DwC-A) export format#1131
mihow wants to merge 15 commits intomainfrom
feat/dwca-export

Conversation

@mihow
Copy link
Copy Markdown
Collaborator

@mihow mihow commented Feb 11, 2026

Summary

Add a new dwca export format that produces a GBIF-compatible Darwin Core Archive (ZIP containing event.txt, occurrence.txt, meta.xml, eml.xml). Uses Event Core architecture: events are the core entity, occurrences are an extension linked via eventID. Field mappings, metadata generators, and taxonomy helpers live in a new ami/exports/dwca.py module, registered via the existing export registry and available immediately through the exports API.

This is the foundation for publishing AMI biodiversity data to GBIF. The initial implementation produces static archive downloads. Future work will add richer metadata, multimedia extensions, and eventually automated publishing via IPT integration.

List of Changes

  • Added ami/exports/dwca.py — DwC field definitions, TSV writer, meta.xml/eml.xml generators, taxonomy helpers
  • Added DwCAExporter(BaseExporter) class in ami/exports/format_types.py
  • Registered "dwca" format in ami/exports/registry.py
  • Added 10 tests in ami/exports/tests.py covering ZIP structure, headers, row counts, meta.xml, referential integrity, taxonomy extraction, EML validity, and collection filtering
  • Added docs/claude/dwca-format-reference.md — DwC-A format reference
  • Extended docs/claude/export-framework.md with API reference, DwC-A specifics, and operations guide

Related Issues

Detailed Description

Archive structure

project_export.zip
├── meta.xml          ← column → DwC term URI mappings
├── eml.xml           ← dataset metadata (EML 2.1.1)
├── event.txt         ← Tab-separated Event Core (one row per sampling event)
└── occurrence.txt    ← Tab-separated Occurrence Extension (one row per occurrence)

Field mappings: Event Core (event.txt)

Each row = one AMI Event (a single trap/camera night).

Column DwC Term Antenna Source Notes
eventID dwc:eventID "urn:ami:event:{project_slug}:{event.id}" Globally unique URN
eventDate dwc:eventDate event.start / event.end ISO 8601 date or interval (2024-06-15/2024-06-16)
eventTime dwc:eventTime event.start.time() HH:MM:SS
year dwc:year event.start.year
month dwc:month event.start.month
day dwc:day event.start.day
samplingProtocol dwc:samplingProtocol "automated light trap with camera" Project-level constant
sampleSizeValue dwc:sampleSizeValue event.captures_count Number of images captured
sampleSizeUnit dwc:sampleSizeUnit "images"
samplingEffort dwc:samplingEffort event.end - event.start Formatted as "12h 30m"
locationID dwc:locationID event.deployment.name Trap/deployment name
decimalLatitude dwc:decimalLatitude event.deployment.latitude 6 decimal places
decimalLongitude dwc:decimalLongitude event.deployment.longitude 6 decimal places
geodeticDatum dwc:geodeticDatum "WGS84" Constant
datasetName dwc:datasetName event.project.name
modified dc:modified event.updated_at ISO 8601 datetime

Field mappings: Occurrence Extension (occurrence.txt)

Each row = one AMI Occurrence. Only occurrences with a valid event and a determination are included (orphaned or undetermined records are excluded).

Column DwC Term Antenna Source Notes
eventID dwc:eventID "urn:ami:event:{slug}:{occurrence.event_id}" Links to event core
occurrenceID dwc:occurrenceID "urn:ami:occurrence:{slug}:{occurrence.id}" Globally unique URN
basisOfRecord dwc:basisOfRecord "MachineObservation" All records are ML-detected
occurrenceStatus dwc:occurrenceStatus "present" AMI does not record absences
scientificName dwc:scientificName occurrence.determination.name Taxon name from best determination
taxonRank dwc:taxonRank occurrence.determination.rank.lower() e.g. "species", "genus"
kingdom dwc:kingdom determination.parents_json Walked from precomputed parent chain
phylum dwc:phylum determination.parents_json
class dwc:class determination.parents_json
order dwc:order determination.parents_json
family dwc:family determination.parents_json
genus dwc:genus determination.parents_json
specificEpithet dwc:specificEpithet Second word of scientificName e.g. "plana" from "Adaina plana"
vernacularName dwc:vernacularName determination.common_name_en English common name
taxonID dwc:taxonID determination.gbif_taxon_key GBIF backbone taxon key
individualCount dwc:individualCount "1" Always 1 — each occurrence = one individual
identificationVerificationStatus dwc:identificationVerificationStatus "verified" / "unverified" Based on presence of human identifications
modified dc:modified occurrence.updated_at ISO 8601 datetime

Key design decisions

Decision Rationale
Event Core (not Occurrence Core) AMI records sampling effort (trap nights) — Event Core is correct for sampling-event datasets
basisOfRecord = "MachineObservation" All identifications are ML pipeline outputs
Taxonomy from parents_json Pre-computed parent chain avoids N+1 queries; no additional DB joins needed
individualCount = 1 Each Occurrence represents one individual in AMI's model; detections_count is bounding boxes across frames, not individuals
URN-format IDs urn:ami:event:{slug}:{id} — globally unique, stable, and human-readable
Events derived from filtered occurrences get_events_queryset() builds event list from self.queryset.values_list("event_id") — ensures event core and occurrence extension are always in sync when filters are applied
Direct TSV writing DwC fields are flat extractions, not nested API representations; avoids DRF serializer overhead
queryset.iterator(chunk_size=500) Memory-efficient streaming for large exports
EML 2.1.1 Current GBIF-accepted version; upgrade to 2.2.0 is a follow-up item

Near follow-up (before real GBIF submission)

  • Apply project default quality filters (min confidence threshold) — without this, low-confidence ML determinations get published
  • Add license field on events — GBIF requires a dataset license for reuse terms
  • Add identifiedBy / dateIdentified — provenance for who/what made the determination
  • Add associatedMedia — detection image URLs
  • Runtime validation before packaging — check for missing required fields, orphaned references
  • Multimedia extension file (multimedia.txt)
  • Upgrade EML to 2.2.0

Eventual follow-up

  • EML geographic/temporal coverage computed from actual data
  • country, stateProvince, locality on events (requires reverse geocoding)
  • coordinateUncertaintyInMeters, institutionCode, collectionCode
  • GBIF Data Validator automated integration test
  • IPT server integration

How to Test the Changes

Prerequisites

  • A running Antenna instance with at least one project containing events and ML-identified occurrences
  • API access (JWT or session auth) with Researcher or Admin role

1 · Create an export via the API

curl -X POST https://<host>/api/v2/exports/ \
  -H "Authorization: ******" \
  -H "Content-Type: application/json" \
  -d '{"project": <project_id>, "format": "dwca"}'

Note the id from the response, then poll until file_url is non-null:

curl https://<host>/api/v2/exports/<export_id>/ -H "Authorization: ******"

2 · Inspect the ZIP

curl -o export.zip "<file_url>"
unzip -l export.zip
# Expected: event.txt, occurrence.txt, meta.xml, eml.xml

unzip -p export.zip event.txt | head -3
unzip -p export.zip occurrence.txt | head -3
unzip -p export.zip meta.xml
unzip -p export.zip eml.xml

3 · Validate with GBIF Data Validator

Upload the ZIP to https://www.gbif.org/tools/data-validator. Expected: no critical errors.

4 · Spot-check referential integrity

unzip -p export.zip event.txt | cut -f1 | sort > /tmp/event_ids.txt
unzip -p export.zip occurrence.txt | tail -n +2 | cut -f1 | sort | uniq > /tmp/coreid_ids.txt
comm -23 /tmp/coreid_ids.txt /tmp/event_ids.txt
# → should produce no output (all coreids matched)

5 · Test with collection filter

curl -X POST https://<host>/api/v2/exports/ \
  -H "Authorization: ******" \
  -H "Content-Type: application/json" \
  -d '{"project": <project_id>, "format": "dwca", "filters": {"collection_id": <col_id>}}'

6 · Run automated tests

python manage.py test ami.exports.tests.DwCAExporterTests --verbosity=2

All 10 DwC-A tests should pass (ZIP structure, headers, row counts, meta.xml, referential integrity, taxonomy, EML validity, collection filtering).

Screenshots

N/A — backend export format, no UI changes.

Deployment Notes

No migrations or config changes required. The "dwca" format is registered automatically and appears in the API's valid format choices immediately on deployment.

Checklist

  • I have tested these changes appropriately.
  • I have added and/or modified relevant tests.
  • I updated relevant documentation or comments.
  • I have verified that this PR follows the project's coding standards.
  • Any dependent changes have already been merged to main.

mihow and others added 4 commits February 11, 2026 11:45
Add planning document for Darwin Core Archive export format, export
framework technical reference, DwC-A format reference with field
mappings, and downloaded DwC terms quick reference from TDWG.

Co-Authored-By: Claude <noreply@anthropic.com>
Add Event Core + Occurrence Extension DwC-A exporter that produces a
GBIF-compatible ZIP containing event.txt, occurrence.txt, meta.xml,
and eml.xml. Events are the core entity with occurrences linked via
eventID foreign key.

Key design decisions:
- Direct TSV writing with iterator(chunk_size=500) instead of DRF serializers
- Taxonomy hierarchy extracted from parents_json to avoid N+1 queries
- meta.xml generated from the same field definitions used for TSV columns
- basisOfRecord = "MachineObservation" for all records
- URN format IDs: urn:ami:event:{slug}:{id}, urn:ami:occurrence:{slug}:{id}

Co-Authored-By: Claude <noreply@anthropic.com>
Test ZIP structure, event/occurrence headers and row counts, meta.xml
core/extension structure, referential integrity between events and
occurrences, taxonomy hierarchy extraction from parents_json, specific
epithet parsing, and EML metadata validity.

Co-Authored-By: Claude <noreply@anthropic.com>
Co-Authored-By: Claude <noreply@anthropic.com>
@netlify
Copy link
Copy Markdown

netlify bot commented Feb 11, 2026

Deploy Preview for antenna-preview canceled.

Name Link
🔨 Latest commit c8aadb7
🔍 Latest deploy log https://app.netlify.com/projects/antenna-preview/deploys/698d322b1918830008eff4fe

@netlify
Copy link
Copy Markdown

netlify bot commented Feb 11, 2026

Deploy Preview for antenna-ssec canceled.

Name Link
🔨 Latest commit c8aadb7
🔍 Latest deploy log https://app.netlify.com/projects/antenna-ssec/deploys/698d322b1bf45a0008d103f1

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai bot commented Feb 11, 2026

Important

Review skipped

Draft detected.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

This pull request introduces a complete Darwin Core Archive (DwC-A) export format for AMI data. It adds a new exporter class, field mapping definitions, helper utilities for data transformation, metadata generation, registry integration, comprehensive test coverage, and technical documentation.

Changes

Cohort / File(s) Summary
Planning & Framework Docs
.agents/planning/dwca-export-plan.md, docs/claude/export-framework.md
High-level planning document for DwC-A feature and technical reference for the Export Framework architecture, registry pattern, and runtime flow.
DwC-A Core Implementation
ami/exports/dwca.py
Implements DwC field mappings for Event and Occurrence, helper functions for date/coordinate/taxonomy formatting, TSV export logic, meta.xml and eml.xml generation, and ZIP archiving.
Exporter Integration
ami/exports/format_types.py, ami/exports/registry.py
Adds DwCAExporter class with get_queryset, get_events_queryset, and export methods; registers "dwca" format in ExportRegistry (note: DwCAExporter appears duplicated in format_types.py).
Test Suite
ami/exports/tests.py
Comprehensive DwCAExportTest covering exporter registration, ZIP contents validation, header correctness, row counts, referential integrity, taxonomy extraction, and EML metadata.
DwC-A Reference Documentation
docs/claude/dwca-format-reference.md
Detailed DwC-A format specification including archive structure, meta.xml/EML schemas, DwC term mappings, and validation rules.

Sequence Diagram

sequenceDiagram
    participant Client
    participant API as DataExport API
    participant Exporter as DwCAExporter
    participant DB as Database
    participant Files as File System
    participant Storage as ZIP Archive

    Client->>API: POST export (format=dwca)
    API->>Exporter: initialize & export()
    Exporter->>DB: get_events_queryset()
    DB-->>Exporter: Event records
    Exporter->>Files: write_tsv(event.txt)
    Files-->>Exporter: event.txt path
    Exporter->>DB: get_queryset()
    DB-->>Exporter: Occurrence records
    Exporter->>Files: write_tsv(occurrence.txt)
    Files-->>Exporter: occurrence.txt path
    Exporter->>Exporter: generate_meta_xml()
    Exporter->>Exporter: generate_eml_xml()
    Exporter->>Storage: create_dwca_zip(event.txt, occurrence.txt, meta.xml, eml.xml)
    Storage-->>Exporter: dwca.zip path
    Exporter-->>API: return ZIP path
    API-->>Client: export complete with download URL
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Poem

🐰 A Darwin's dream in archives bright,
Events and occurrences packed just right,
Metadata nestled in XML's fold,
DwC-A stories ready to be told! 📦

🚥 Pre-merge checks | ✅ 2 | ❌ 1
❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 79.31% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Title check ✅ Passed The PR title 'feat: add Darwin Core Archive (DwC-A) export format' clearly and concisely summarizes the main change: introducing a new export format for DwC-A archives.
Description check ✅ Passed PR description follows the template structure with Summary, Design decisions, Changes table, Test plan, Known issues, and Near/Eventual follow-ups. All required sections are present and substantive.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch feat/dwca-export

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Co-Authored-By: Claude <noreply@anthropic.com>
@mihow mihow marked this pull request as ready for review February 11, 2026 20:34
Copilot AI review requested due to automatic review settings February 11, 2026 20:34
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 12

🤖 Fix all issues with AI agents
In `@ami/exports/dwca.py`:
- Around line 132-142: The _format_duration function must guard against negative
durations: check event.start and event.end and compute delta = event.end -
event.start, then if delta.total_seconds() <= 0 return an empty string (or
optionally log a warning) instead of proceeding; update the logic around
total_seconds / divmod so negative values are not formatted (use the check on
delta.total_seconds() before computing hours/minutes and then format only for
positive durations).
- Around line 62-63: The taxonRank mapping lambda can raise AttributeError when
o.determination.rank is None; update the DWC + "taxonRank" tuple's lambda (the
function referencing o.determination.rank.lower()) to guard by checking both
o.determination and o.determination.rank before calling .lower(), returning an
empty string (or appropriate default) when rank is falsy.
- Line 87: The mapping that emits DWC + "individualCount" currently uses
getattr(o, "detections_count", 0) which is semantically wrong; change the tuple
entry for DWC + "individualCount" so it always emits the string "1" (i.e.,
replace the lambda that references detections_count with one that returns "1")
or remove the entire tuple if you prefer omission; update the tuple containing
DWC + "individualCount" and the lambda (the anonymous function referencing
getattr(o, "detections_count", 0)) accordingly.
- Around line 75-79: The ternary expression in the tuple entry with key
"vernacularName" is ambiguous; update the lambda (lambda o, slug:
o.determination.common_name_en or "" if o.determination else "") to add
parentheses around the ternary's true branch so it reads like
(o.determination.common_name_en or "") if o.determination else "" to make the
intent explicit and easier to read.
- Around line 309-312: The EML export is inserting the owner's email into the
<surName> element (project.owner.email) which leaks PII and is semantically
wrong; update the block that creates creator -> individualName -> surName to
avoid using project.owner.email: prefer using a non-PII display name (e.g.,
project.owner.get_display_name() or project.owner.name) if available, and if no
safe display name exists omit creating the individualName/surName elements
entirely (i.e., only create creator when a valid non-email name/contact is
present). Ensure references to project.owner, creator, individualName, and
surName are updated accordingly.
- Around line 288-290: Change the naive timestamp to a timezone-aware one:
replace the use of datetime.datetime.now() when assigning now (used alongside
project_slug/slugify) with Django's timezone-aware function (timezone.now()),
and add the appropriate import from django.utils import timezone at the top of
the module so now is timezone-aware when used in exports/dwca.py.

In `@ami/exports/format_types.py`:
- Around line 196-208: The Occurrence queryset in get_queryset currently
includes occurrences lacking related event or determination, producing empty
coreid/scientificName; update get_queryset (the Occurrence.objects.valid() chain
in the get_queryset method) to filter out occurrences with null event and null
determination (e.g., .filter(event__isnull=False, determination__isnull=False))
before the select_related/.with_identifications() calls so only occurrences with
both event and determination are exported.
- Around line 237-264: The temp files created for the DwC-A export
(event_file.name, occ_file.name and the produced zip_path) are never cleaned up;
wrap the export sequence in a try/finally around the code that calls
get_events_queryset, write_tsv, generate_meta_xml, generate_eml_xml and
create_dwca_zip so that in the finally block you remove event_file.name,
occ_file.name and (on error or success) delete any intermediate temp zip if
appropriate; ensure update_export_stats and the return of zip_path still occur
on success but that file removal happens only for the intermediate files (or
conditionally for the zip if you decide not to persist it).

In `@ami/exports/tests.py`:
- Around line 477-502: The test test_taxonomy_hierarchy_extraction can pass
vacuously when taxon.parents_json is empty; update the test to guarantee
parents_json is populated before calling _get_rank_from_parents: after obtaining
taxon (from occurrence.determination) and calling
taxon.save(update_calculated_fields=True) and taxon.refresh_from_db(), add an
assertion that taxon.parents_json is truthy (or else set up explicit test data/a
fixture so parents_json contains a parent chain) so the loop over ranks actually
executes and _get_rank_from_parents is exercised.

In `@docs/claude/dwca-format-reference.md`:
- Around line 94-113: The documentation shows EML 2.2.0 namespace
("https://eml.ecoinformatics.org/eml-2.2.0") but the implementation currently
emits EML 2.1.1 (literal "eml://ecoinformatics.org/eml-2.1.1"); either make the
docs reflect 2.1.1 or update the implementation to 2.2.0 so they match. To fix:
search the codebase for the literal "eml://ecoinformatics.org/eml-2.1.1" (and
any code that constructs the eml:eml root or schemaLocation) and change the
namespace and schemaLocation to the 2.2.0 values used in the doc, or
alternatively update the doc snippet to the 2.1.1 namespace and schemaLocation;
ensure the change is applied to the EML root construction logic (the
function/method that builds the eml:eml element) and the documentation fragment
so both use the same namespace string.

In `@docs/claude/export-framework.md`:
- Line 13: Update the documentation table entry that currently lists
JSONExporter and CSVExporter to also include the new DwCAExporter: edit the row
showing `ami/exports/format_types.py` so it mentions Concrete exporters
`JSONExporter`, `CSVExporter`, and `DwCAExporter` (use the exact class name
DwCAExporter to match the implementation).
- Around line 68-71: Update the example output of
ExportRegistry.get_supported_formats() to include the newly registered "dwca"
format so the snippet accurately reflects current registrations; adjust the
example list returned by ExportRegistry.get_supported_formats() (used alongside
ExportRegistry.register and ExportRegistry.get_exporter in the snippet) to
include "dwca" among the formats (e.g., ["occurrences_api_json",
"occurrences_simple_csv", "dwca"]).
🧹 Nitpick comments (3)
docs/claude/dwca-format-reference.md (1)

9-16: Add a language identifier to the fenced code block.

The linter flags this code block for missing a language specifier (MD040). Since it's a file tree diagram, use ```text.

ami/exports/tests.py (1)

329-341: Each test re-runs the full export pipeline independently.

Every test method calls _run_export(), which creates a DataExport, runs the entire export pipeline, and writes to storage. For 7 tests that call this helper, that's significant overhead. Consider using setUpClass or caching the export result in setUp to run the export once and share the ZIP across structural validation tests.

This is fine for correctness — just a test performance observation.

ami/exports/dwca.py (1)

286-286: Unused events_queryset parameter.

events_queryset is accepted but never used in generate_eml_xml. Either remove it or use it to populate temporal/geographic coverage metadata (the latter is listed as a follow-up). For now, removing the parameter or marking it for future use avoids confusion.

Proposed fix — remove unused parameter
-def generate_eml_xml(project, events_queryset=None) -> str:
+def generate_eml_xml(project) -> str:

This also requires updating the call site in ami/exports/format_types.py (line 258):

-        eml_xml = generate_eml_xml(self.project, events_qs)
+        eml_xml = generate_eml_xml(self.project)

Comment on lines +75 to +79
(
DWC + "vernacularName",
"vernacularName",
lambda o, slug: o.determination.common_name_en or "" if o.determination else "",
),
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion | 🟠 Major

vernacularName lambda: add parentheses for clarity (known issue #4).

While Python's operator precedence makes this technically correct (or binds tighter than the ternary), the expression is confusing to read and a common source of bugs. Adding explicit parentheses removes ambiguity.

Proposed fix
     (
         DWC + "vernacularName",
         "vernacularName",
-        lambda o, slug: o.determination.common_name_en or "" if o.determination else "",
+        lambda o, slug: (o.determination.common_name_en or "") if o.determination else "",
     ),
🧰 Tools
🪛 Ruff (0.15.0)

[warning] 78-78: Unused lambda argument: slug

(ARG005)

🤖 Prompt for AI Agents
In `@ami/exports/dwca.py` around lines 75 - 79, The ternary expression in the
tuple entry with key "vernacularName" is ambiguous; update the lambda (lambda o,
slug: o.determination.common_name_en or "" if o.determination else "") to add
parentheses around the ternary's true branch so it reads like
(o.determination.common_name_en or "") if o.determination else "" to make the
intent explicit and easier to read.

Comment on lines +132 to +142
def _format_duration(event) -> str:
"""Format event duration as human-readable string."""
if not event.start or not event.end:
return ""
delta = event.end - event.start
total_seconds = int(delta.total_seconds())
hours, remainder = divmod(total_seconds, 3600)
minutes, _ = divmod(remainder, 60)
if hours > 0:
return f"{hours}h {minutes}m"
return f"{minutes}m"
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Negative durations are not guarded against.

If event.end < event.start (data inconsistency), total_seconds will be negative, and divmod with a negative dividend produces negative results, yielding strings like "-1h -30m". Consider returning an empty string or logging a warning for negative durations.

Proposed fix
 def _format_duration(event) -> str:
     """Format event duration as human-readable string."""
     if not event.start or not event.end:
         return ""
     delta = event.end - event.start
     total_seconds = int(delta.total_seconds())
+    if total_seconds < 0:
+        return ""
     hours, remainder = divmod(total_seconds, 3600)
     minutes, _ = divmod(remainder, 60)
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
def _format_duration(event) -> str:
"""Format event duration as human-readable string."""
if not event.start or not event.end:
return ""
delta = event.end - event.start
total_seconds = int(delta.total_seconds())
hours, remainder = divmod(total_seconds, 3600)
minutes, _ = divmod(remainder, 60)
if hours > 0:
return f"{hours}h {minutes}m"
return f"{minutes}m"
def _format_duration(event) -> str:
"""Format event duration as human-readable string."""
if not event.start or not event.end:
return ""
delta = event.end - event.start
total_seconds = int(delta.total_seconds())
if total_seconds < 0:
return ""
hours, remainder = divmod(total_seconds, 3600)
minutes, _ = divmod(remainder, 60)
if hours > 0:
return f"{hours}h {minutes}m"
return f"{minutes}m"
🤖 Prompt for AI Agents
In `@ami/exports/dwca.py` around lines 132 - 142, The _format_duration function
must guard against negative durations: check event.start and event.end and
compute delta = event.end - event.start, then if delta.total_seconds() <= 0
return an empty string (or optionally log a warning) instead of proceeding;
update the logic around total_seconds / divmod so negative values are not
formatted (use the check on delta.total_seconds() before computing hours/minutes
and then format only for positive durations).

Comment on lines +288 to +290

project_slug = slugify(project.name)
now = datetime.datetime.now().strftime("%Y-%m-%dT%H:%M:%S")
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Use timezone-aware datetime.

datetime.datetime.now() produces a naive datetime. In a Django project with USE_TZ=True, this is inconsistent and could produce incorrect timestamps. Use Django's timezone.now().

Proposed fix
+    from django.utils import timezone
+
     project_slug = slugify(project.name)
-    now = datetime.datetime.now().strftime("%Y-%m-%dT%H:%M:%S")
+    now = timezone.now().strftime("%Y-%m-%dT%H:%M:%S")
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
project_slug = slugify(project.name)
now = datetime.datetime.now().strftime("%Y-%m-%dT%H:%M:%S")
from django.utils import timezone
project_slug = slugify(project.name)
now = timezone.now().strftime("%Y-%m-%dT%H:%M:%S")
🤖 Prompt for AI Agents
In `@ami/exports/dwca.py` around lines 288 - 290, Change the naive timestamp to a
timezone-aware one: replace the use of datetime.datetime.now() when assigning
now (used alongside project_slug/slugify) with Django's timezone-aware function
(timezone.now()), and add the appropriate import from django.utils import
timezone at the top of the module so now is timezone-aware when used in
exports/dwca.py.

Comment on lines +94 to +113
```xml
<?xml version="1.0" encoding="UTF-8"?>
<eml:eml xmlns:eml="https://eml.ecoinformatics.org/eml-2.2.0"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="https://eml.ecoinformatics.org/eml-2.2.0 https://eml.ecoinformatics.org/eml-2.2.0/eml.xsd"
packageId="urn:ami:dataset:{project_id}" system="AMI">
<dataset>
<title>{project.name}</title>
<creator>
<organizationName>{project.owner or institution}</organizationName>
</creator>
<abstract>
<para>{project.description}</para>
</abstract>
<intellectualRights>
<para>License information here</para>
</intellectualRights>
</dataset>
</eml:eml>
```
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

EML version mismatch between documentation and implementation.

This reference doc shows EML 2.2.0 namespace (https://eml.ecoinformatics.org/eml-2.2.0), but the actual implementation in ami/exports/dwca.py (line 293) uses EML 2.1.1 (eml://ecoinformatics.org/eml-2.1.1). The planning doc acknowledges upgrading to 2.2.0 as a follow-up, but this reference should either match the current implementation or clearly note the discrepancy.

🤖 Prompt for AI Agents
In `@docs/claude/dwca-format-reference.md` around lines 94 - 113, The
documentation shows EML 2.2.0 namespace
("https://eml.ecoinformatics.org/eml-2.2.0") but the implementation currently
emits EML 2.1.1 (literal "eml://ecoinformatics.org/eml-2.1.1"); either make the
docs reflect 2.1.1 or update the implementation to 2.2.0 so they match. To fix:
search the codebase for the literal "eml://ecoinformatics.org/eml-2.1.1" (and
any code that constructs the eml:eml root or schemaLocation) and change the
namespace and schemaLocation to the 2.2.0 values used in the doc, or
alternatively update the doc snippet to the 2.1.1 namespace and schemaLocation;
ensure the change is applied to the EML root construction logic (the
function/method that builds the eml:eml element) and the documentation fragment
so both use the same namespace string.

Comment on lines +68 to +71
ExportRegistry.register("format_name")(ExporterClass)
ExportRegistry.get_exporter("format_name") # → ExporterClass
ExportRegistry.get_supported_formats() # → ["occurrences_api_json", "occurrences_simple_csv"]
```
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Example output of get_supported_formats() is incomplete.

The example on line 70 should include "dwca" to reflect the new registration.

Proposed fix
-ExportRegistry.get_supported_formats()       # → ["occurrences_api_json", "occurrences_simple_csv"]
+ExportRegistry.get_supported_formats()       # → ["occurrences_api_json", "occurrences_simple_csv", "dwca"]
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
ExportRegistry.register("format_name")(ExporterClass)
ExportRegistry.get_exporter("format_name") # → ExporterClass
ExportRegistry.get_supported_formats() # → ["occurrences_api_json", "occurrences_simple_csv"]
```
ExportRegistry.register("format_name")(ExporterClass)
ExportRegistry.get_exporter("format_name") # → ExporterClass
ExportRegistry.get_supported_formats() # → ["occurrences_api_json", "occurrences_simple_csv", "dwca"]
🤖 Prompt for AI Agents
In `@docs/claude/export-framework.md` around lines 68 - 71, Update the example
output of ExportRegistry.get_supported_formats() to include the newly registered
"dwca" format so the snippet accurately reflects current registrations; adjust
the example list returned by ExportRegistry.get_supported_formats() (used
alongside ExportRegistry.register and ExportRegistry.get_exporter in the
snippet) to include "dwca" among the formats (e.g., ["occurrences_api_json",
"occurrences_simple_csv", "dwca"]).

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a new export format (dwca) to generate a GBIF-compatible Darwin Core Archive (DwC-A) ZIP using an Event core (event.txt) with an Occurrence extension (occurrence.txt), along with generated meta.xml and eml.xml.

Changes:

  • Introduces ami/exports/dwca.py with DwC field mappings, TSV writing, and meta/eml XML generation.
  • Adds DwCAExporter to the existing export framework and registers it under "dwca".
  • Adds a new DwC-A-focused test suite validating ZIP contents and basic structure.

Reviewed changes

Copilot reviewed 7 out of 8 changed files in this pull request and generated 14 comments.

Show a summary per file
File Description
ami/exports/dwca.py New DwC-A field definitions and archive/metadata generators.
ami/exports/format_types.py Adds DwCAExporter implementation to orchestrate TSV generation + ZIP packaging.
ami/exports/registry.py Registers the new "dwca" export format.
ami/exports/tests.py Adds end-to-end tests covering ZIP structure, TSV headers/rows, and XML parsing.
docs/claude/export-framework.md New internal technical reference for the export framework.
docs/claude/dwca-format-reference.md New DwC-A format reference documentation.
.agents/planning/dwca-export-plan.md Planning/design notes and known issues list for DwC-A export.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

|------|---------|
| `ami/exports/base.py` | `BaseExporter` ABC - all exporters inherit from this |
| `ami/exports/registry.py` | `ExportRegistry` - maps format strings to exporter classes |
| `ami/exports/format_types.py` | Concrete exporters: `JSONExporter`, `CSVExporter` |
Copy link

Copilot AI Feb 11, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The “Concrete exporters” bullet lists only JSONExporter and CSVExporter, but this PR adds DwCAExporter in ami/exports/format_types.py. Update this reference so the technical doc matches the supported exporters.

Suggested change
| `ami/exports/format_types.py` | Concrete exporters: `JSONExporter`, `CSVExporter` |
| `ami/exports/format_types.py` | Concrete exporters: `JSONExporter`, `CSVExporter`, `DwCAExporter` |

Copilot uses AI. Check for mistakes.
Comment on lines +196 to +208
def get_queryset(self):
"""Return the occurrence queryset (used by BaseExporter for record count)."""
return (
Occurrence.objects.valid() # type: ignore[union-attr]
.filter(project=self.project)
.select_related(
"determination",
"event",
"deployment",
)
.with_detections_count()
.with_identifications()
)
Copy link

Copilot AI Feb 11, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DwC-A occurrence exports can currently include rows with event_id or determination missing, which produces empty coreid/eventID and scientificName values in occurrence.txt (GBIF validator rejects these). Filter the queryset to exclude occurrences with null event and null determination (and adjust the expected counts in tests accordingly).

Copilot uses AI. Check for mistakes.
progress_callback=lambda n: self.update_job_progress(n),
)
logger.info(f"DwC-A: wrote {occ_count} occurrences")

Copy link

Copilot AI Feb 11, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Job progress for DwC-A exports is only updated every 500 occurrences; for smaller exports progress may remain at 0% even though the job completes. Consider calling update_job_progress(occ_count) once at the end (guarding against total_records == 0 to avoid division by zero in BaseExporter.update_job_progress()).

Suggested change
# Ensure final progress update, even if the callback was never invoked.
# Guard against total_records == 0 to avoid division by zero in BaseExporter.update_job_progress().
if getattr(self, "total_records", 0):
self.update_job_progress(occ_count)

Copilot uses AI. Check for mistakes.
Comment on lines +240 to +250
# Column 0 is the id (eventID)
id_elem = ET.SubElement(core, "id")
id_elem.set("index", "0")

for i, (term_uri, header, _) in enumerate(event_fields):
if i == 0:
continue # Already declared as <id>
field = ET.SubElement(core, "field")
field.set("index", str(i))
field.set("term", term_uri)

Copy link

Copilot AI Feb 11, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In meta.xml generation, column 0 is declared as <id index="0"/> but the corresponding <field index="0" term="...eventID"/> mapping is skipped. Many DwC-A consumers/validators expect the ID column to also be mapped to the Darwin Core term (as in the DwC Text Guide examples). Include a <field> mapping for index 0 instead of skipping it.

Copilot uses AI. Check for mistakes.
mihow and others added 2 commits February 11, 2026 13:33
- Guard taxonRank against None rank (AttributeError on .lower())
- Guard _get_rank_from_parents against None det_rank
- Fix vernacularName ternary precedence with explicit parentheses
- Change individualCount to emit "1" (each occurrence = 1 individual,
  not detections_count which counts bounding boxes)
- Guard _format_duration against negative durations

Co-Authored-By: Claude <noreply@anthropic.com>
- Filter out occurrences with null event or determination from DwC-A
  export queryset (GBIF rejects empty coreid/scientificName)
- Replace project.owner.email with project.owner.name in EML creator
  element to avoid leaking PII in downloadable archives
- Only emit individualName when owner has a name set

Co-Authored-By: Claude <noreply@anthropic.com>
@mihow mihow marked this pull request as draft February 11, 2026 21:51
mihow and others added 8 commits February 11, 2026 15:26
- Wrap DwC-A export in try/finally to clean up intermediate temp files
- Use timezone.now() instead of naive datetime.datetime.now()
- Use full EML schemaLocation URL for GBIF validation compatibility
- Remove unused events_queryset parameter from generate_eml_xml
- Simplify progress_callback lambda to direct method reference

Co-Authored-By: Claude <noreply@anthropic.com>
- Update occurrence row count test to match null-filtered queryset
- Assert parents_json is populated in taxonomy hierarchy test
- Use settings.MEDIA_URL instead of hardcoded "/media/" in test helper
- Add DwCAExporter to export-framework.md file table and registry example
- Fix EML version in dwca-format-reference.md to match implementation (2.1.1)

Co-Authored-By: Claude <noreply@anthropic.com>
…pdate

- Map all columns (including index 0) to DwC term URIs in meta.xml
  so GBIF validators can resolve both <id>/<coreid> and <field> entries
- Change fieldsEnclosedBy from empty string to double-quote character
  for stricter parser compatibility
- Add final progress update after TSV writing so small exports (<500
  records) report completion instead of staying at 0%

Co-Authored-By: Claude <noreply@anthropic.com>
Run the export pipeline once in setUpClass and share the ZIP across all
structural validation tests instead of re-running per test method. This
reduces test time from ~7 export runs to 1.

Co-Authored-By: Claude <noreply@anthropic.com>
…occurrences in DwC-A

Remove get_filter_backends() override that returned [], allowing DwCAExporter
to inherit BaseExporter's OccurrenceCollectionFilter. Update get_events_queryset()
to derive events from self.queryset instead of fetching all project events,
preventing orphaned events when collection_id filtering is active.

Co-Authored-By: Claude <noreply@anthropic.com>
…ssertion

Add test_dwca_export_with_collection_filter that verifies filtered exports
produce correct occurrence/event counts and referential integrity. Update
test_event_headers_and_row_count to expect events derived from occurrences
rather than all project events.

Co-Authored-By: Claude <noreply@anthropic.com>
Co-Authored-By: Claude <noreply@anthropic.com>
Add API methods, request/response format, filter system, DwC-A specifics,
job integration, and file lifecycle details. Remove separate export-system.md.

Co-Authored-By: Claude <noreply@anthropic.com>
@mihow mihow added the next up! label Mar 26, 2026
@mihow
Copy link
Copy Markdown
Collaborator Author

mihow commented Apr 1, 2026

@copilot add a more accessible overview of all the field mappings and decisions from Antenna -> Darwin Core & GBIF in the PR description so this can be more easily reviewed & tested. Add how to test instructions.

Copy link
Copy Markdown
Contributor

Copilot AI commented Apr 1, 2026

@copilot add a more accessible overview of all the field mappings and decisions from Antenna -> Darwin Core & GBIF in the PR description so this can be more easily reviewed & tested. Add how to test i...

Updated the PR description with:

  • Field mapping tables for both event.txt (16 columns) and occurrence.txt (18 columns) showing the Antenna model source, DwC term, and any relevant notes for each field
  • Design decisions table — explains why Event Core was chosen, why individualCount=1, how taxonomy is extracted, why events are derived from the filtered occurrence queryset, etc.
  • Step-by-step testing guide covering: API export creation, polling for completion, ZIP inspection, GBIF Data Validator upload, shell-based referential integrity check, collection filter test, and the automated test commands with expected test names

Copilot finished work on behalf of mihow April 1, 2026 06:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants