Skip to content

Features/ibis new design#20

Merged
egillax merged 18 commits intodevelopfrom
features/ibis-new-design-develop
Mar 20, 2026
Merged

Features/ibis new design#20
egillax merged 18 commits intodevelopfrom
features/ibis-new-design-develop

Conversation

@egillax
Copy link
Collaborator

@egillax egillax commented Mar 18, 2026

Summary

This PR replaces the legacy builder/context-based Ibis execution prototype with a new layered execution
engine built around normalized models, lowering, Ibis compilation, and explicit cohort orchestration.

The new public execution entrypoints are:

  • build_cohort(...)
  • write_cohort(...)

build_cohort(...) returns a lazy Ibis relation in the canonical execution shape.

write_cohort(...) materializes OHDSI cohort-table rows with cohort-scoped semantics:

  • if_exists="fail" errors only if rows already exist for that cohort_id
  • if_exists="replace" replaces only that cohort’s rows and preserves rows for other cohorts in the
    same target table

What Changed

  • replaced the legacy builder/context-based Ibis execution path with a layered execution subsystem
  • introduced explicit execution layers:
    • normalize/
    • lower/
    • ibis/
    • engine/
  • changed the public execution API to function-first entrypoints:
    • build_cohort(...)
    • write_cohort(...)
  • standardized compiled domain events into a canonical event schema before cohort orchestration
  • implemented cohort-scoped write semantics in write_cohort(...)
  • added focused execution tests and documented the intended testing strategy
  • added architecture documentation for reviewers and future maintainers

Execution Flow

flowchart LR
    A[Public cohort models] --> B[normalize]
    B --> C[lower]
    C --> D[ibis compile]
    D --> E[engine semantics<br/>primary events -> groups -> inclusion -> end strategy -> censoring ->
collapse]
    E --> F[final Ibis relation]
    F --> G[build_cohort]
    F --> H[write_cohort]
Loading

Why

The old execution prototype had too much mutable, builder-specific state, too much coupling between
cohort semantics and backend-specific implementation details, and too much duplicated execution logic.

This redesign aims to make the execution path:

  • easier to reason about
  • easier to test by layer
  • easier to maintain
  • easier to extend to new semantics and backends
  • less dependent on mutable executor state
  • less duplicated across execution concerns

Future Opportunities Enabled by This Design

This layered execution design should make several follow-up capabilities easier to add without re-
entangling cohort semantics with executor state, including:

  • stage-labeled execution tracing
  • final and intermediate SQL inspection
  • optional backend-level executed-statement logging
  • semantic “explain” views for cohort execution
  • plan-level diffs for regression review
  • smarter caching or partial rematerialization
  • richer provenance and debugging tooling
  • alternative wrapper shells built on the same execution core

Migration Notes

This PR intentionally changes the execution API shape.

If you used the old execution prototype:

  • use build_cohort(...) to get the lazy Ibis relation
  • use backend/relation methods directly for inspection and collection
  • use write_cohort(...) for cohort-table writes

Examples:

  • old dataframe collection helpers map to relation methods such as relation.to_pandas() or
    relation.to_polars()
  • old SQL inspection helpers such as capture_sql() are not carried forward as executor-owned APIs
  • SQL inspection now happens through the returned Ibis relation and backend tooling
  • old executor-owned write behavior maps to write_cohort(...)

In practice:

  • inspection and collection now happen through the returned relation and backend
  • write semantics now live in write_cohort(...), not a mutable executor object

Reviewer Guide

Recommended reading order:

  1. circe/__init__.py
  2. circe/api.py
  3. circe/execution/README.md
  4. circe/execution/ARCHITECTURE.md
  5. circe/execution/TESTING.md
  6. circe/execution/api.py
  7. circe/execution/normalize/
  8. circe/execution/lower/
  9. circe/execution/ibis/
  10. circe/execution/engine/

circe/__init__.py and circe/api.py are worth reading first because they show the external package/
API surface exposed by the new execution engine.

Current Limitation

Testing

Executed locally:

  • uv run ruff check .
  • uv run ruff format --check .
  • uv run pytest -q

Result:

  • 1075 passed
  • 17 skipped
  • 1 xfailed

Note:

  • pytest still emits DuckDB/Ibis deprecation warnings from upstream fetch_arrow_table() calls in
    Ibis’s DuckDB backend

@codecov
Copy link

codecov bot commented Mar 18, 2026

Codecov Report

❌ Patch coverage is 95.47206% with 94 lines in your changes missing coverage. Please review.
✅ Project coverage is 85.39%. Comparing base (fa4e9c5) to head (0db6e37).
⚠️ Report is 19 commits behind head on develop.

Files with missing lines Patch % Lines
circe/execution/engine/group_demographics.py 82.97% 16 Missing ⚠️
circe/execution/engine/primary.py 78.57% 9 Missing ⚠️
circe/execution/ibis/person_filters.py 89.39% 7 Missing ⚠️
circe/execution/engine/group_operators.py 92.06% 5 Missing ⚠️
circe/execution/engine/groups.py 88.37% 5 Missing ⚠️
circe/execution/normalize/cohort.py 93.24% 5 Missing ⚠️
circe/execution/api.py 91.66% 4 Missing ⚠️
circe/execution/ibis_compat.py 88.57% 4 Missing ⚠️
circe/execution/lower/common.py 95.83% 4 Missing ⚠️
circe/execution/normalize/groups.py 94.44% 4 Missing ⚠️
... and 23 more
Additional details and impacted files
@@             Coverage Diff             @@
##           develop      #20      +/-   ##
===========================================
+ Coverage    76.87%   85.39%   +8.51%     
===========================================
  Files          133      167      +34     
  Lines        12126    12379     +253     
===========================================
+ Hits          9322    10571    +1249     
+ Misses        2804     1808     -996     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@egillax egillax requested a review from azimov March 19, 2026 09:13
@egillax
Copy link
Collaborator Author

egillax commented Mar 19, 2026

Ready for review @azimov

Sorry for amount of code.

@azimov
Copy link
Collaborator

azimov commented Mar 19, 2026

@egillax The review will probably take me some time but I'm very optimistic about this after these changes. The updated benchmarks show that this is at least a 2x performance increase on real data in databricks/healthverity. We can likely do more too (e.g. if cohorts have shared components we can probably build planners that use them). The fact that we can also use this for a unified feature extraction approach also has a good appeal to me.

The lack of support for custom eras could be a blocker but there are potential workaround - we could use sqlglot in the cases where the windowing logic is difficult/not supported by ibis implementations.

@egillax
Copy link
Collaborator Author

egillax commented Mar 19, 2026

I think I can add the custom eras. Just didn't see immediately how to do it and rather than having it block this I'll do it separately once I've fogured it out

Copy link
Collaborator

@azimov azimov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall this is pretty strong, I think the code could be tidier and I noted some more patterns that could improve it for extendability.

The planning approach naturally extends itself to feature extraction too, I would expect that we can build some pretty cool stuff with this.

Happy for you to merge when ready

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This and the TESTING.md file should probably live in the docs.

In general I think we should configure the LLMs to make as few .md files as possible as it gets pretty annoying and they're frequently outdated (so will just confuse the next agent).


## Canonical Event Schema

All compiled domain event tables are standardized before cohort orchestration.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems like it will be very useful going forwards, we can likely build upon this to design ways to understand common patterns an pathways in cohorts

from ..typing import Table


class CachedConceptSetResolver:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This class could likely be adapted to add a persistent caching layer making the resolution instant in many cases. Probably worth doing in a separate PR though.

setattr(backend_cls, _PATCH_FLAG, True)
return True


Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

strange flow (and funny naming), would this not be better structured when loading the backend to being with?

) -> EventPlan: ...


LOWERERS: dict[type[Criteria], LowerFn] = {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could be assigned dynamically at runtime on top of criteria classes - I'm thinking with a decorator pattern.

@register_lowering("ProcedureOccurrence")
def lower_procedure_occurrence():
   ....

This would naturally be extendable for extension tables, and removes this hard coding linking. This could go further with filters but that might be too much decorator spam.

criterion_index: int,
) -> EventPlan:
raw = criterion.raw_criteria
if not isinstance(raw, Death):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A decorator could also remove this boilerplate that is in every function here

exclude=bool(raw.death_type_exclude),
)

return build_standard_domain_plan(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again - this feels like boilerplate that could be removed.

from .common import lower_standard_domain_plan


def lower_dose_era(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is just default behaviour so I don't think it needs a function and a file for every single implementation

if not isinstance(raw, DeviceExposure):
raise TypeError("lower_device_exposure requires DeviceExposure criteria")

steps = lower_common_steps(criterion)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm wondering if this too could be structured differently, is this over design here?

 return (
        DomainPlanBuilder(criterion, criterion_index)
        .with_concept_filter(
            "device_type_concept_id",
            concepts_attr="device_type",
            codeset_attr="device_type_cs",
            exclude_attr="device_type_exclude",
        )
        .with_text_filter("unique_device_id", value_attr="unique_device_id")
        .with_numeric_filter("quantity", value_attr="quantity")
        .with_provider_specialty_filter()
        .with_visit_filter()
        .build()
    )

My thought is that this type of pattern could be more readable and extendable. We could also add conditions like `min_cdm_ver='5.5' (though this is probably better handled in the model classes).

)


def lower_location_region(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this function is totally different to the others, maybe the AI got bored?

@egillax
Copy link
Collaborator Author

egillax commented Mar 20, 2026

thanks @azimov , I'm moving the docs and merging. Then I'll create issues for the other suggestions you made for follow up PRs.

@egillax egillax merged commit 39bed84 into develop Mar 20, 2026
9 checks passed
@egillax egillax deleted the features/ibis-new-design-develop branch March 20, 2026 12:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants