Conversation
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## develop #20 +/- ##
===========================================
+ Coverage 76.87% 85.39% +8.51%
===========================================
Files 133 167 +34
Lines 12126 12379 +253
===========================================
+ Hits 9322 10571 +1249
+ Misses 2804 1808 -996 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
|
Ready for review @azimov Sorry for amount of code. |
|
@egillax The review will probably take me some time but I'm very optimistic about this after these changes. The updated benchmarks show that this is at least a 2x performance increase on real data in databricks/healthverity. We can likely do more too (e.g. if cohorts have shared components we can probably build planners that use them). The fact that we can also use this for a unified feature extraction approach also has a good appeal to me. The lack of support for custom eras could be a blocker but there are potential workaround - we could use sqlglot in the cases where the windowing logic is difficult/not supported by ibis implementations. |
|
I think I can add the custom eras. Just didn't see immediately how to do it and rather than having it block this I'll do it separately once I've fogured it out |
There was a problem hiding this comment.
Overall this is pretty strong, I think the code could be tidier and I noted some more patterns that could improve it for extendability.
The planning approach naturally extends itself to feature extraction too, I would expect that we can build some pretty cool stuff with this.
Happy for you to merge when ready
circe/execution/README.md
Outdated
There was a problem hiding this comment.
This and the TESTING.md file should probably live in the docs.
In general I think we should configure the LLMs to make as few .md files as possible as it gets pretty annoying and they're frequently outdated (so will just confuse the next agent).
circe/execution/README.md
Outdated
|
|
||
| ## Canonical Event Schema | ||
|
|
||
| All compiled domain event tables are standardized before cohort orchestration. |
There was a problem hiding this comment.
This seems like it will be very useful going forwards, we can likely build upon this to design ways to understand common patterns an pathways in cohorts
| from ..typing import Table | ||
|
|
||
|
|
||
| class CachedConceptSetResolver: |
There was a problem hiding this comment.
This class could likely be adapted to add a persistent caching layer making the resolution instant in many cases. Probably worth doing in a separate PR though.
| setattr(backend_cls, _PATCH_FLAG, True) | ||
| return True | ||
|
|
||
|
|
There was a problem hiding this comment.
strange flow (and funny naming), would this not be better structured when loading the backend to being with?
| ) -> EventPlan: ... | ||
|
|
||
|
|
||
| LOWERERS: dict[type[Criteria], LowerFn] = { |
There was a problem hiding this comment.
This could be assigned dynamically at runtime on top of criteria classes - I'm thinking with a decorator pattern.
@register_lowering("ProcedureOccurrence")
def lower_procedure_occurrence():
....This would naturally be extendable for extension tables, and removes this hard coding linking. This could go further with filters but that might be too much decorator spam.
| criterion_index: int, | ||
| ) -> EventPlan: | ||
| raw = criterion.raw_criteria | ||
| if not isinstance(raw, Death): |
There was a problem hiding this comment.
A decorator could also remove this boilerplate that is in every function here
| exclude=bool(raw.death_type_exclude), | ||
| ) | ||
|
|
||
| return build_standard_domain_plan( |
There was a problem hiding this comment.
Again - this feels like boilerplate that could be removed.
| from .common import lower_standard_domain_plan | ||
|
|
||
|
|
||
| def lower_dose_era( |
There was a problem hiding this comment.
This is just default behaviour so I don't think it needs a function and a file for every single implementation
| if not isinstance(raw, DeviceExposure): | ||
| raise TypeError("lower_device_exposure requires DeviceExposure criteria") | ||
|
|
||
| steps = lower_common_steps(criterion) |
There was a problem hiding this comment.
I'm wondering if this too could be structured differently, is this over design here?
return (
DomainPlanBuilder(criterion, criterion_index)
.with_concept_filter(
"device_type_concept_id",
concepts_attr="device_type",
codeset_attr="device_type_cs",
exclude_attr="device_type_exclude",
)
.with_text_filter("unique_device_id", value_attr="unique_device_id")
.with_numeric_filter("quantity", value_attr="quantity")
.with_provider_specialty_filter()
.with_visit_filter()
.build()
)My thought is that this type of pattern could be more readable and extendable. We could also add conditions like `min_cdm_ver='5.5' (though this is probably better handled in the model classes).
| ) | ||
|
|
||
|
|
||
| def lower_location_region( |
There was a problem hiding this comment.
this function is totally different to the others, maybe the AI got bored?
|
thanks @azimov , I'm moving the docs and merging. Then I'll create issues for the other suggestions you made for follow up PRs. |
Summary
This PR replaces the legacy builder/context-based Ibis execution prototype with a new layered execution
engine built around normalized models, lowering, Ibis compilation, and explicit cohort orchestration.
The new public execution entrypoints are:
build_cohort(...)write_cohort(...)build_cohort(...)returns a lazy Ibis relation in the canonical execution shape.write_cohort(...)materializes OHDSI cohort-table rows with cohort-scoped semantics:if_exists="fail"errors only if rows already exist for thatcohort_idif_exists="replace"replaces only that cohort’s rows and preserves rows for other cohorts in thesame target table
What Changed
normalize/lower/ibis/engine/build_cohort(...)write_cohort(...)write_cohort(...)Execution Flow
flowchart LR A[Public cohort models] --> B[normalize] B --> C[lower] C --> D[ibis compile] D --> E[engine semantics<br/>primary events -> groups -> inclusion -> end strategy -> censoring -> collapse] E --> F[final Ibis relation] F --> G[build_cohort] F --> H[write_cohort]Why
The old execution prototype had too much mutable, builder-specific state, too much coupling between
cohort semantics and backend-specific implementation details, and too much duplicated execution logic.
This redesign aims to make the execution path:
Future Opportunities Enabled by This Design
This layered execution design should make several follow-up capabilities easier to add without re-
entangling cohort semantics with executor state, including:
Migration Notes
This PR intentionally changes the execution API shape.
If you used the old execution prototype:
build_cohort(...)to get the lazy Ibis relationwrite_cohort(...)for cohort-table writesExamples:
relation.to_pandas()orrelation.to_polars()capture_sql()are not carried forward as executor-owned APIswrite_cohort(...)In practice:
write_cohort(...), not a mutable executor objectReviewer Guide
Recommended reading order:
circe/__init__.pycirce/api.pycirce/execution/README.mdcirce/execution/ARCHITECTURE.mdcirce/execution/TESTING.mdcirce/execution/api.pycirce/execution/normalize/circe/execution/lower/circe/execution/ibis/circe/execution/engine/circe/__init__.pyandcirce/api.pyare worth reading first because they show the external package/API surface exposed by the new execution engine.
Current Limitation
custom_eraend strategy is still unsupported in this execution path and raises an explicitexecution-layer error
Testing
Executed locally:
uv run ruff check .uv run ruff format --check .uv run pytest -qResult:
1075 passed17 skipped1 xfailedNote:
fetch_arrow_table()calls inIbis’s DuckDB backend