Skip to content

Example cohorts and benchmarks for ibis#22

Open
azimov wants to merge 9 commits intodevelopfrom
features/ibis-benchmark
Open

Example cohorts and benchmarks for ibis#22
azimov wants to merge 9 commits intodevelopfrom
features/ibis-benchmark

Conversation

@azimov
Copy link
Collaborator

@azimov azimov commented Mar 18, 2026

@egillax here is a basic benchmark. I've done this in R as that's where people currently interface with it. This is also pointing at your branch as I didn't want to step on any toes.

Core takeaway - When timing the process - running the cohorts is generally faster on ibis in duckdb, but the overall process is slower because of the overhead of generating the SQL. I'm not sure how much we care about that though. There are also probably signficant savings we can make from re-using concept sets (which is in your branch) and other planning optimizations when generating multiple cohorts simultaniously.

I will add a report for databricks shortly

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@egillax this shows a ~55% improvement but it should be noted that this is when the lazy evaluation has been completed. A better approach might be to render the sql and just compare that. From a user perspective ibis adds processing overhead that may make cohort generation slower.

@egillax
Copy link
Collaborator

egillax commented Mar 19, 2026

@azimov did you run both duckdb and databricks on eunomia?

From the numbers at least it looks like small data. Process overhead is probably constant as you scale. But we should definitely take not of it and explore further at some point. I can also test this on postgres and duckdb on real data locally, after this week though since there's some maintenance going on in our server room.

We should also note cohorts where there are big differences as places to investigate further. I think with new engine we can have like a custom materialization/caching strategy tuned to each backend, some backends may be really good at figuring out the best plan themselves while others not (Looking at you postgres).

@azimov
Copy link
Collaborator Author

azimov commented Mar 19, 2026

@egillax - the databricks testing is actually on healthverity so this is a meaningful performance increase. I will make the script available separatley to the duckdb one - however I had to tweak the ibis code to build the relation and separately which is kind of messy.

I think performance tuning would be good, but this is something we may not have the ability to set for the user. For example, adaptive query execution and partition coalescing are configurable on spark but most users probably won't be able to adjust these settings.

Base automatically changed from features/ibis-new-design-develop to develop March 20, 2026 12:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants