Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
The table of contents is too big for display.
Diff view
Diff view
  •  
  •  
  •  
75 changes: 75 additions & 0 deletions benchmark_report.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,75 @@
# Cohort Generation Execution Benchmark

*Date generated:* 2026-03-18 14:24:28.792419

This report compares the execution time of cohort generation between the traditional **CirceR (Java + T-SQL)** and the new **circe_py (Python + Ibis)** implementation.

### Aggregate Performance
- **CirceR/Java Average Median Time:** 19.30 ms
- **CircePy/Ibis Average Median Time:** 18.85 ms

> **Conclusion:** `circe_py` (Ibis) is generally **faster** than `SqlRender` by **~2.3%** when evaluating identically generated cohorts.

### Raw Results
| Cohort | Approach | Min | lq | Mean | Median | uq | Max | Neval |
|---|---|---|---|---|---|---|---|---|
| 10.json | CirceR_Java_DBI | 13.90 | 14.10 | 15.17 | 14.40 | 15.90 | 18.44 | 10 |
| | CircePy_Ibis_DuckDB | 4.48 | 4.58 | 5.47 | 5.21 | 6.28 | 7.43 | 10 |
| 100.json | CirceR_Java_DBI | 23.87 | 24.86 | 27.73 | 27.13 | 29.68 | 33.68 | 10 |
| | CircePy_Ibis_DuckDB | 10.41 | 11.21 | 12.46 | 12.40 | 13.83 | 14.40 | 10 |
| 1000.json | CirceR_Java_DBI | 12.55 | 12.67 | 13.46 | 13.01 | 13.90 | 15.70 | 10 |
| | CircePy_Ibis_DuckDB | 2.96 | 3.25 | 3.51 | 3.52 | 3.60 | 4.18 | 10 |
| 1001.json | CirceR_Java_DBI | 13.38 | 13.56 | 14.38 | 14.24 | 14.47 | 17.32 | 10 |
| | CircePy_Ibis_DuckDB | 3.53 | 3.82 | 4.17 | 4.04 | 4.28 | 5.66 | 10 |
| 1002.json | CirceR_Java_DBI | 13.08 | 13.56 | 14.08 | 14.09 | 14.54 | 15.20 | 10 |
| | CircePy_Ibis_DuckDB | 2.93 | 3.01 | 3.35 | 3.39 | 3.65 | 3.78 | 10 |
| 1003.json | CirceR_Java_DBI | 10.94 | 11.31 | 12.26 | 11.55 | 12.06 | 15.87 | 10 |
| | CircePy_Ibis_DuckDB | 4.93 | 5.53 | 6.57 | 5.68 | 6.88 | 12.75 | 10 |
| 1004.json | CirceR_Java_DBI | 11.22 | 11.73 | 13.00 | 12.40 | 13.43 | 16.80 | 10 |
| | CircePy_Ibis_DuckDB | 4.64 | 5.08 | 5.50 | 5.44 | 5.65 | 7.15 | 10 |
| 1005.json | CirceR_Java_DBI | 14.40 | 14.71 | 16.63 | 16.11 | 18.71 | 19.41 | 10 |
| | CircePy_Ibis_DuckDB | 3.87 | 4.15 | 4.65 | 4.29 | 4.45 | 8.27 | 10 |
| 1006.json | CirceR_Java_DBI | 17.88 | 19.09 | 23.73 | 22.73 | 27.82 | 36.86 | 10 |
| | CircePy_Ibis_DuckDB | 4.05 | 4.36 | 5.27 | 5.19 | 5.70 | 7.24 | 10 |
| 1007.json | CirceR_Java_DBI | 27.90 | 29.12 | 31.12 | 29.98 | 31.89 | 37.63 | 10 |
| | CircePy_Ibis_DuckDB | 19.45 | 19.87 | 24.08 | 23.85 | 25.21 | 34.15 | 10 |
| 1009.json | CirceR_Java_DBI | 51.96 | 52.18 | 54.75 | 53.13 | 57.70 | 60.11 | 10 |
| | CircePy_Ibis_DuckDB | 109.65 | 118.49 | 126.79 | 123.25 | 135.94 | 146.27 | 10 |
| 1010.json | CirceR_Java_DBI | 15.56 | 15.90 | 16.91 | 16.52 | 17.45 | 20.41 | 10 |
| | CircePy_Ibis_DuckDB | 32.26 | 34.48 | 36.62 | 35.08 | 37.44 | 47.88 | 10 |
| 1011.json | CirceR_Java_DBI | 11.31 | 11.76 | 12.08 | 12.06 | 12.42 | 13.06 | 10 |
| | CircePy_Ibis_DuckDB | 4.98 | 5.18 | 5.56 | 5.59 | 6.04 | 6.15 | 10 |
| 1012.json | CirceR_Java_DBI | 11.65 | 11.82 | 12.64 | 12.12 | 12.89 | 16.19 | 10 |
| | CircePy_Ibis_DuckDB | 5.59 | 5.81 | 6.13 | 6.11 | 6.27 | 6.86 | 10 |
| 1013.json | CirceR_Java_DBI | 11.42 | 11.77 | 12.29 | 12.32 | 12.92 | 13.18 | 10 |
| | CircePy_Ibis_DuckDB | 4.86 | 5.37 | 5.48 | 5.41 | 5.74 | 6.14 | 10 |
| 1016.json | CirceR_Java_DBI | 14.41 | 14.75 | 18.62 | 16.87 | 20.66 | 30.34 | 10 |
| | CircePy_Ibis_DuckDB | 7.85 | 9.19 | 10.87 | 9.60 | 13.18 | 18.71 | 10 |
| 1017.json | CirceR_Java_DBI | 11.96 | 12.23 | 12.91 | 12.67 | 13.50 | 15.13 | 10 |
| | CircePy_Ibis_DuckDB | 6.49 | 6.75 | 7.40 | 7.11 | 7.35 | 10.22 | 10 |
| 1018.json | CirceR_Java_DBI | 11.65 | 12.45 | 14.46 | 13.53 | 15.68 | 21.79 | 10 |
| | CircePy_Ibis_DuckDB | 4.56 | 5.07 | 7.10 | 5.34 | 8.30 | 12.47 | 10 |
| 1019.json | CirceR_Java_DBI | 16.19 | 17.09 | 18.92 | 17.54 | 19.98 | 25.55 | 10 |
| | CircePy_Ibis_DuckDB | 16.94 | 17.45 | 19.03 | 18.22 | 21.05 | 22.03 | 10 |
| 1020.json | CirceR_Java_DBI | 26.52 | 28.96 | 37.04 | 34.96 | 44.87 | 51.18 | 10 |
| | CircePy_Ibis_DuckDB | 18.99 | 19.20 | 22.83 | 19.41 | 24.89 | 35.34 | 10 |
| 1021.json | CirceR_Java_DBI | 21.82 | 24.82 | 33.44 | 29.32 | 42.61 | 51.70 | 10 |
| | CircePy_Ibis_DuckDB | 40.08 | 46.80 | 55.84 | 53.25 | 69.19 | 72.25 | 10 |
| 1022.json | CirceR_Java_DBI | 19.17 | 21.31 | 27.71 | 22.62 | 35.44 | 49.43 | 10 |
| | CircePy_Ibis_DuckDB | 24.18 | 24.78 | 27.38 | 26.16 | 28.88 | 36.16 | 10 |
| 1023.json | CirceR_Java_DBI | 25.40 | 27.33 | 28.62 | 28.36 | 29.49 | 32.65 | 10 |
| | CircePy_Ibis_DuckDB | 38.29 | 38.63 | 40.16 | 39.66 | 40.84 | 45.80 | 10 |
| 1024.json | CirceR_Java_DBI | 17.94 | 19.07 | 22.18 | 20.38 | 24.48 | 32.79 | 10 |
| | CircePy_Ibis_DuckDB | 21.80 | 22.29 | 24.13 | 22.64 | 23.39 | 36.53 | 10 |
| 1025.json | CirceR_Java_DBI | 29.39 | 30.29 | 31.92 | 30.97 | 32.82 | 37.89 | 10 |
| | CircePy_Ibis_DuckDB | 34.85 | 36.47 | 39.93 | 38.42 | 43.38 | 52.41 | 10 |
| 1026.json | CirceR_Java_DBI | 13.14 | 14.88 | 20.48 | 16.16 | 25.98 | 43.50 | 10 |
| | CircePy_Ibis_DuckDB | 5.78 | 6.12 | 6.71 | 6.39 | 7.23 | 8.68 | 10 |
| 1027.json | CirceR_Java_DBI | 11.86 | 12.30 | 12.68 | 12.70 | 13.13 | 13.49 | 10 |
| | CircePy_Ibis_DuckDB | 5.24 | 5.54 | 5.69 | 5.68 | 5.88 | 6.15 | 10 |
| 1028.json | CirceR_Java_DBI | 12.69 | 13.13 | 13.69 | 13.37 | 13.62 | 16.58 | 10 |
| | CircePy_Ibis_DuckDB | 4.22 | 4.68 | 4.99 | 4.95 | 5.48 | 5.89 | 10 |
| 1029.json | CirceR_Java_DBI | 13.26 | 13.40 | 15.09 | 14.42 | 14.89 | 22.23 | 10 |
| | CircePy_Ibis_DuckDB | 51.15 | 51.99 | 56.75 | 53.93 | 55.72 | 78.29 | 10 |
| 1030.json | CirceR_Java_DBI | 12.27 | 12.96 | 13.62 | 13.43 | 14.31 | 15.24 | 10 |
| | CircePy_Ibis_DuckDB | 5.82 | 6.05 | 6.33 | 6.38 | 6.54 | 6.76 | 10 |
42 changes: 42 additions & 0 deletions benchmark_report_databricks.md
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@egillax this shows a ~55% improvement but it should be noted that this is when the lazy evaluation has been completed. A better approach might be to render the sql and just compare that. From a user perspective ibis adds processing overhead that may make cohort generation slower.

Original file line number Diff line number Diff line change
@@ -0,0 +1,42 @@
# Cohort Generation Execution Benchmark (Databricks)

*Date generated:* 2026-03-18 19:13:18.096735

This report compares the execution time of cohort generation on **Databricks** between the traditional **CirceR (Java + SqlRender)** and the new **circe_py (Python + Ibis)** implementation.

## Benchmark Configuration

- **Pre-compilation**: Both approaches pre-compile SQL/relations before execution
- **Validation**: Python uses `skip_validation=TRUE` to bypass table/row checks (matches Java behavior)
- **Spark Optimizations**: Adaptive query execution, partition coalescing, and broadcast joins enabled
- **Iterations**: Each cohort benchmarked with identical parameters

### Aggregate Performance
- **CirceR/Java Average Median Time:** 48.78 ms
- **CircePy/Ibis Average Median Time:** 21.62 ms

> **Conclusion:** `circe_py` (Ibis) is generally **faster** than Circe-be/`SqlRender` by **~55.7%** when evaluating identically generated cohorts.

### Raw Results
| Cohort | Approach | Min | lq | Mean | Median | uq | Max | Neval |
|---|---|---|---|---|---|---|---|---|
| 10.json | CirceR_Java_Databricks | 53.26 | 53.26 | 53.26 | 53.26 | 53.26 | 53.26 | 1 |
| | CircePy_Ibis_Databricks | 41.86 | 41.86 | 41.86 | 41.86 | 41.86 | 41.86 | 1 |
| 100.json | CirceR_Java_Databricks | 50.24 | 50.24 | 50.24 | 50.24 | 50.24 | 50.24 | 1 |
| | CircePy_Ibis_Databricks | 26.66 | 26.66 | 26.66 | 26.66 | 26.66 | 26.66 | 1 |
| 1000.json | CirceR_Java_Databricks | 50.52 | 50.52 | 50.52 | 50.52 | 50.52 | 50.52 | 1 |
| | CircePy_Ibis_Databricks | 15.47 | 15.47 | 15.47 | 15.47 | 15.47 | 15.47 | 1 |
| 1001.json | CirceR_Java_Databricks | 59.03 | 59.03 | 59.03 | 59.03 | 59.03 | 59.03 | 1 |
| | CircePy_Ibis_Databricks | 37.61 | 37.61 | 37.61 | 37.61 | 37.61 | 37.61 | 1 |
| 1002.json | CirceR_Java_Databricks | 33.83 | 33.83 | 33.83 | 33.83 | 33.83 | 33.83 | 1 |
| | CircePy_Ibis_Databricks | 9.94 | 9.94 | 9.94 | 9.94 | 9.94 | 9.94 | 1 |
| 1003.json | CirceR_Java_Databricks | 33.42 | 33.42 | 33.42 | 33.42 | 33.42 | 33.42 | 1 |
| | CircePy_Ibis_Databricks | 11.35 | 11.35 | 11.35 | 11.35 | 11.35 | 11.35 | 1 |
| 1004.json | CirceR_Java_Databricks | 38.99 | 38.99 | 38.99 | 38.99 | 38.99 | 38.99 | 1 |
| | CircePy_Ibis_Databricks | 11.68 | 11.68 | 11.68 | 11.68 | 11.68 | 11.68 | 1 |
| 1005.json | CirceR_Java_Databricks | 55.79 | 55.79 | 55.79 | 55.79 | 55.79 | 55.79 | 1 |
| | CircePy_Ibis_Databricks | 20.65 | 20.65 | 20.65 | 20.65 | 20.65 | 20.65 | 1 |
| 1006.json | CirceR_Java_Databricks | 69.09 | 69.09 | 69.09 | 69.09 | 69.09 | 69.09 | 1 |
| | CircePy_Ibis_Databricks | 19.52 | 19.52 | 19.52 | 19.52 | 19.52 | 19.52 | 1 |
| 1007.json | CirceR_Java_Databricks | 43.67 | 43.67 | 43.67 | 43.67 | 43.67 | 43.67 | 1 |
| | CircePy_Ibis_Databricks | 21.49 | 21.49 | 21.49 | 21.49 | 21.49 | 21.49 | 1 |
132 changes: 130 additions & 2 deletions circe/execution/api.py
Original file line number Diff line number Diff line change
Expand Up @@ -84,6 +84,30 @@ def write_relation(
) from exc


def build_cohort_relation(
expression: CohortExpression,
*,
backend: IbisBackendLike,
cdm_schema: str,
cohort_id: int,
results_schema: str | None = None,
vocabulary_schema: str | None = None,
) -> Table:
"""Build cohort relation and project to OHDSI format without materializing.

This returns a lazy Ibis expression ready to be written to a table.
Useful for benchmarking or when you want to separate compilation from execution.
"""
cohort_expr = build_cohort(
expression,
backend=backend,
cdm_schema=cdm_schema,
results_schema=results_schema,
vocabulary_schema=vocabulary_schema,
)
return project_to_ohdsi_cohort_table(cohort_expr, cohort_id=cohort_id)


def write_cohort(
expression: CohortExpression,
*,
Expand All @@ -99,14 +123,14 @@ def write_cohort(
if if_exists not in {"fail", "replace"}:
raise ValueError("if_exists must be one of {'fail', 'replace'} for write_cohort.")

new_rows = build_cohort(
new_rows = build_cohort_relation(
expression,
backend=backend,
cdm_schema=cdm_schema,
cohort_id=cohort_id,
results_schema=results_schema,
vocabulary_schema=vocabulary_schema,
)
new_rows = project_to_ohdsi_cohort_table(new_rows, cohort_id=cohort_id)

if not table_exists(backend, table_name=cohort_table, schema=results_schema):
write_relation(
Expand Down Expand Up @@ -161,3 +185,107 @@ def write_cohort(
target_schema=results_schema,
if_exists="replace",
)


def write_cohort_relation(
cohort_relation: Table,
*,
backend: IbisBackendLike,
cohort_table: str,
cohort_id: int,
results_schema: str | None = None,
if_exists: Literal["fail", "replace", "append"] = "fail",
skip_validation: bool = False,
) -> None:
"""Write a pre-built cohort relation to a table.

This function takes a cohort relation (from build_cohort_relation) and writes it
to the target table. Useful for benchmarking when you want to separate compilation
from execution.

Args:
cohort_relation: Pre-built cohort relation (from build_cohort_relation)
backend: Ibis backend connection
cohort_table: Name of the target cohort table
cohort_id: Cohort definition ID (for checking conflicts)
results_schema: Schema containing the cohort table
if_exists: What to do if cohort rows exist: "fail", "replace", or "append"
skip_validation: Skip table existence and cohort row checks (faster, use for benchmarking)
"""
if if_exists not in {"fail", "replace", "append"}:
raise ValueError("if_exists must be one of {'fail', 'replace', 'append'} for write_cohort_relation.")

# Fast path for benchmarking: skip all validation checks
if skip_validation:
insert_relation(
cohort_relation,
backend=backend,
target_table=cohort_table,
target_schema=results_schema,
)
return

if not table_exists(backend, table_name=cohort_table, schema=results_schema):
write_relation(
cohort_relation,
backend=backend,
target_table=cohort_table,
target_schema=results_schema,
if_exists="fail",
)
return

if if_exists == "append":
# Just append without checking for existing rows
insert_relation(
cohort_relation,
backend=backend,
target_table=cohort_table,
target_schema=results_schema,
)
return

if if_exists == "fail":
if cohort_rows_exist(
backend,
cohort_table=cohort_table,
results_schema=results_schema,
cohort_id=cohort_id,
):
raise ExecutionError(
"Ibis executor write error: cohort table "
f"'{cohort_table}' already contains rows for cohort_id={cohort_id}."
)
insert_relation(
cohort_relation,
backend=backend,
target_table=cohort_table,
target_schema=results_schema,
)
return

# if_exists == "replace"
if supports_transactional_replace(backend):
replace_cohort_rows_transactionally(
cohort_relation,
backend=backend,
cohort_table=cohort_table,
results_schema=results_schema,
cohort_id=cohort_id,
)
return

existing = read_table(
backend,
table_name=cohort_table,
schema=results_schema,
)
filtered = exclude_cohort_rows(existing, cohort_id=cohort_id)
relation = filtered.union(cohort_relation, distinct=False)
write_relation(
relation,
backend=backend,
target_table=cohort_table,
target_schema=results_schema,
if_exists="replace",
)
Loading
Loading