Skip to content

Roadmap: CMDR ↔ OMOP interoperability (depends on #14) #15

Description

@mellybelly

Companion to #14. This roadmap presumes the layered-module refactor proposed there lands first — several sections below reference classes/modules (cmdr-clinical.yaml, Measurement, Provider, OmopCoded mixin) that don't exist in cmdr today.

Framing: what "interoperability" actually means here

Three layers, often conflated:

  1. Schema-level compatibility — cmdr's class shapes, slot names, and enum codings don't fight OMOP's.
  2. Vocabulary-level compatibility — coded fields bind to OMOP concept IDs using a shared source-vs-standard pattern.
  3. Data-level transformation — a deterministic, testable pipeline that converts cmdr instance data to OMOP CDM rows (and, more modestly, the reverse).

All three matter, but they're sequential: data transforms are brittle unless schema + vocabulary conventions hold first.

Directional asymmetry

  • cmdr → OMOP (common path): export research cohort data for analytics in an OMOP-shaped warehouse. High-fidelity for clinical events (Condition, DrugExposure, Measurement, Visit, Procedure, Death). Lossy for cmdr's research-context additions (Study, Consent, Specimen activity stack, Questionnaire, Family). Acceptable loss, as long as we're explicit about it.
  • OMOP → cmdr (niche but useful): wrap an existing OMOP dataset as a cmdr cohort — e.g., when a research project draws from an EHR warehouse. Clinical tables map in trivially; cmdr's Study, Consent, Participant metadata has to be synthesized from project context.

Treat the cmdr → OMOP direction as load-bearing and ship it first; leave OMOP → cmdr for a later phase.

Phased roadmap

Phase 0 — Baseline (pre-req)

Phase 1 — Schema alignment with OMOP conventions

  • Document an explicit principle in cmdr: "when a concept is clinical and an OMOP concept exists, bind to it by default."
  • Sweep existing bdchm-style meaning: OMOP:xxxxx bindings in bdc.yaml/bdchm.yaml into the cmdr enums they should have been in all along.
  • Define cmdr temporal conventions that round-trip to OMOP's *_date / *_datetime pairs (cmdr's TimePoint already supports both).

Phase 2 — The concept-triple mixin

Add a LinkML mixin OmopCoded with three slots:

  • concept_id — OMOP standard concept (integer)
  • source_value — the raw recorded string
  • source_concept_id — OMOP concept for the source vocabulary's own term

Apply it to Condition.code, DrugExposure.drug, Measurement.observation_type, Procedure.procedure_type, Visit.category, Observation.type, etc. — via slot-level mixin application so downstream schemas can opt out. Rationale: OMOP's most durable contribution is the three-part provenance of a coded value. cmdr should preserve that, even when OMOP concept_ids aren't yet known (nullable).

Phase 3 — The cmdr-to-omop transform (see "Transform approach" below)

  • Author project/mappings/cmdr-to-omop.transform.yaml using linkml-transformer.
  • Ship a fixture in examples/ that round-trips: synthetic Participant + Condition + DrugExposure + Measurement + Specimen + Questionnaire → OMOP CSV rows that validate against the OMOP CDM DDL.
  • Run this as a CI conformance test on every cmdr release.

Phase 4 — Reverse direction (OMOP → cmdr), lightweight

  • A small adapter that wraps a PERSON + event set as a cmdr Participant with a placeholder Study and empty Consent, and translates clinical events back via the inverse rules.
  • Scope this narrowly — don't try to reconstruct research context that isn't there.

Phase 5 — Tooling, docs, ecosystem

  • Python helper that chains: validate cmdr JSON → transform → write OMOP CSV / Parquet / SQL bulk-load.
  • Publish a "cmdr for OMOP users" and "OMOP for cmdr users" page on the cmdr site.
  • Engage with the OHDSI community for feedback.

Transform approach: cmdr-to-omop.transform.yaml

Use linkml-transformer's derivation rules. Per target table, define: source class(es), ID strategy, concept-triple unpacking, date decomposition, type-concept mapping, and foreign-key resolution.

Identity strategy (universal rule):

  • cmdr uses string IDs; OMOP uses integer surrogate keys. Introduce a stable deterministic mapping (e.g., xxhash64(cmdr_id) mod 2^31) generated once per transform run and persisted alongside output for reproducibility.
  • Preserve the original cmdr ID in each table's *_source_value column wherever OMOP allows.

Class-by-class (abridged):

cmdr class OMOP target ID Key field mappings
Participant PERSON person_id from hash(Participant.id); original → person_source_value demography.sex→gender_concept_id (OMOP:8507/8532); race→race_concept_id; ethnicity→ethnicity_concept_id; yearOfBirth→year_of_birth
ObservationPeriod OBSERVATION_PERIOD sequence period.start/end → observation_period_start/end_date
Visit VISIT_OCCURRENCE sequence category (OmopCoded) → visit_concept_id+source; participant→person_id; period→visit_start/end_datetime
Condition CONDITION_OCCURRENCE sequence code (OmopCoded) → condition_concept_id/source; provenance → condition_type_concept_id; period → condition_start/end_date; visit → visit_occurrence_id
DrugExposure DRUG_EXPOSURE sequence drug (OmopCoded) → drug_concept_id/source; provenance → drug_type_concept_id; dose/route/quantity → quantity/dose_unit_concept_id
DeviceExposure DEVICE_EXPOSURE sequence analogous
Procedure PROCEDURE_OCCURRENCE sequence analogous
Measurement MEASUREMENT sequence type (OmopCoded) → measurement_concept_id; value→value_as_number or value_as_concept_id; unit (OmopCoded) → unit_concept_id
Observation / SdohObservation OBSERVATION sequence analogous to Measurement but qualitative/SDOH
CauseOfDeath DEATH period.start → death_date; cause (OmopCoded) → cause_concept_id
Organization CARE_SITE (+ LOCATION sidecar) sequence name → care_site_name; address slots → LOCATION fields
Provider (new abstract) PROVIDER sequence name → provider_name; specialty (OmopCoded) → specialty_concept_id
Specimen (core) SPECIMEN sequence type (OmopCoded) → specimen_concept_id; collection date → specimen_date; quantity → quantity+unit_concept_id
Relationship / Family FACT_RELATIONSHIP relationshipType (OmopCoded) → relationship_concept_id
QuestionnaireResponseItem OBSERVATION sequence item (OmopCoded, e.g. PROMIS/LOINC) → observation_concept_id; response value → value_as_*

Hard cases / accepted losses:

  • Study / ResearchStudy — no OMOP target. Write to OMOP's METADATA table as free-form provenance rows, and emit a sidecar cmdr_study.json alongside the OMOP output. Document that OMOP consumers can ignore it.
  • Consent — no OMOP target. Same sidecar pattern. (OHDSI's nascent work on data-use compatibility could inform future alignment.)
  • SpecimenContainer / SpecimenActivity stack (creation/processing/storage/transport) — OMOP's SPECIMEN is a single flat row. Accept the loss; emit a cmdr_specimen_lineage.json sidecar for recoverability.
  • Questionnaire (instrument structure, skip logic, item grouping) — only the responses survive in OMOP as OBSERVATION rows; the instrument definition goes to sidecar.
  • Group / Characteristic / cohort criteria — could go to OMOP COHORT_DEFINITION, but only if the cohort was derivable from data; otherwise sidecar.
  • Unmapped concepts — set *_concept_id = 0 (OMOP convention), keep original in *_source_value. Don't silently drop.

Governance & versioning

  • Target OMOP version: pin to v5.4 initially. Add a 6.0 track when 6.0 uptake passes a threshold (currently low).
  • Mapping ownership: lives in the linkml/cmdr repo so it evolves with cmdr. Tag releases matching cmdr's semver.
  • Conformance: CI loads transform output into a Postgres container with OMOP v5.4 DDL; failures block release.
  • Vocabulary currency: OMOP concept_ids change over time. Pin to an ATHENA vocabulary snapshot and record the snapshot ID in transform provenance.

Open design questions

  1. Where should the OmopCoded mixin live — in core cmdr.yaml, or in cmdr-clinical.yaml? (Argument for core: observations and non-clinical things also benefit. Argument for clinical: keeps core domain-neutral.)
  2. Do we want the transform to be lossless round-trippable (cmdr → OMOP+sidecars → cmdr), or accept it as one-way export? Round-trippability is a strong invariant but costs complexity.
  3. What's the minimal cmdr profile we promise maps cleanly? Declare a "cmdr-omop-conformant" subset; schemas adding beyond it carry their own mapping burden.

Success criteria (what "done" looks like for v1)

  • A published cmdr-to-omop.transform.yaml that covers the table above.
  • CI that validates transform output against OMOP v5.4 DDL on a fixture dataset.
  • Documented list of accepted losses and sidecar conventions.
  • One real-world consumer (likely bdchm) using the mapping in anger.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions