Conversation
There was a problem hiding this comment.
@egillax this shows a ~55% improvement but it should be noted that this is when the lazy evaluation has been completed. A better approach might be to render the sql and just compare that. From a user perspective ibis adds processing overhead that may make cohort generation slower.
|
@azimov did you run both duckdb and databricks on eunomia? From the numbers at least it looks like small data. Process overhead is probably constant as you scale. But we should definitely take not of it and explore further at some point. I can also test this on postgres and duckdb on real data locally, after this week though since there's some maintenance going on in our server room. We should also note cohorts where there are big differences as places to investigate further. I think with new engine we can have like a custom materialization/caching strategy tuned to each backend, some backends may be really good at figuring out the best plan themselves while others not (Looking at you postgres). |
…' into features/ibis-benchmark
|
@egillax - the databricks testing is actually on healthverity so this is a meaningful performance increase. I will make the script available separatley to the duckdb one - however I had to tweak the ibis code to build the relation and separately which is kind of messy. I think performance tuning would be good, but this is something we may not have the ability to set for the user. For example, adaptive query execution and partition coalescing are configurable on spark but most users probably won't be able to adjust these settings. |
@egillax here is a basic benchmark. I've done this in R as that's where people currently interface with it. This is also pointing at your branch as I didn't want to step on any toes.
Core takeaway - When timing the process - running the cohorts is generally faster on ibis in duckdb, but the overall process is slower because of the overhead of generating the SQL. I'm not sure how much we care about that though. There are also probably signficant savings we can make from re-using concept sets (which is in your branch) and other planning optimizations when generating multiple cohorts simultaniously.
I will add a report for databricks shortly