Bug/string agg limit (#133)

fivetran-catfritz · fivetran-jamie · web-flow · commit 9ece92fc1f30 · 2024-11-12T16:34:10.000-06:00
* bug/string-agg-limit

* update changelog &amp; regen docs

* Update int_jira__pivot_daily_field_history.sql

* update incremental compatible and regen docs

* regen docs

* update yml

* add tests

* update conversation disablement

* regen docs

* Update int_jira__issue_comments.sql

* update readme

* release review updates

* update readme

* Apply suggestions from code review

Co-authored-by: Jamie Rodriguez &lt;65564846+fivetran-jamie@users.noreply.github.com&gt;

---------

Co-authored-by: Jamie Rodriguez &lt;65564846+fivetran-jamie@users.noreply.github.com&gt;
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -1,3 +1,18 @@
+# dbt_jira v0.19.0
+[PR #133](https://github.com/fivetran/dbt_jira/pull/133) contains the following updates:
+
+## Breaking Changes
+- This change is marked as breaking due to its impact on Redshift configurations.
+- For Redshift users, comment data aggregated under the `conversations` field in the `jira__issue_enhanced` table is now disabled by default to prevent consistent errors related to Redshift's varchar length limits. 
+  - If you wish to re-enable `conversations` on Redshift, set the `jira_include_conversations` variable to `true` in your `dbt_project.yml`.
+
+## Under the Hood
+- Updated the `comment` seed data to ensure conversations are correctly disabled for Redshift by default.
+- Renamed the `jira_is_databricks_sql_warehouse` macro to `jira_is_incremental_compatible`, which was updated to return `true` if the Databricks runtime is an all-purpose cluster (previously it checked only for a SQL warehouse runtime) or if the target is any other non-Databricks-supported destination.
+  - This update addresses Databricks runtimes (e.g., endpoints and external runtimes) that do not support the `insert_overwrite` incremental strategy used in the `jira__daily_issue_field_history` and `int_jira__pivot_daily_field_history` models.
+- For Databricks users, the `jira__daily_issue_field_history` and `int_jira__pivot_daily_field_history` models will now apply the incremental strategy only if running on an all-purpose cluster. All other Databricks runtimes will not utilize an incremental strategy.
+- Added consistency tests for the `jira__project_enhanced` and `jira__user_enhanced` models.
+
 # dbt_jira v0.18.0
 [PR #131](https://github.com/fivetran/dbt_jira/pull/131) contains the following updates:
 ## Breaking Changes
diff --git a/README.md b/README.md
@@ -66,7 +66,7 @@ Include the following jira package version in your `packages.yml` file:
 ```yaml
 packages:
   - package: fivetran/jira
-    version: [">=0.18.0", "<0.19.0"]
+    version: [">=0.19.0", "<0.20.0"]
 
 ```
 ### Step 3: Define database and schema variables
@@ -82,14 +82,30 @@ vars:
 Your Jira connector may not sync every table that this package expects. If you do not have the `SPRINT`, `COMPONENT`, or `VERSION` tables synced, add the respective variables to your root `dbt_project.yml` file. Additionally, if you want to remove comment aggregations from your `jira__issue_enhanced` model,  add the `jira_include_comments` variable to your root `dbt_project.yml`:
 ```yml
 vars:
-    jira_using_sprints: false   # Disable if you do not have the sprint table or do not want sprint-related metrics reported
-    jira_using_components: false # Disable if you do not have the component table or do not want component-related metrics reported
-    jira_using_versions: false # Disable if you do not have the versions table or do not want versions-related metrics reported
-    jira_using_priorities: false # disable if you are not using priorities in Jira
-    jira_include_comments: false # This package aggregates issue comments so that you have a single view of all your comments in the jira__issue_enhanced table. This can cause limit errors if you have a large dataset. Disable to remove this functionality.
+    jira_using_sprints: false    # Enabled by default. Disable if you do not have the sprint table or do not want sprint-related metrics reported.
+    jira_using_components: false # Enabled by default. Disable if you do not have the component table or do not want component-related metrics reported.
+    jira_using_versions: false   # Enabled by default. Disable if you do not have the versions table or do not want versions-related metrics reported.
+    jira_using_priorities: false # Enabled by default. Disable if you are not using priorities in Jira.
+    jira_include_comments: false # Enabled by default. Disabling will remove the aggregation of comments via the `count_comments` and `conversations` columns in the `jira__issue_enhanced` table.
 ```
+
 ### (Optional) Step 5: Additional configurations
 
+#### Controlling conversation aggregations in `jira__issue_enhanced`
+
+The `dbt_jira` package offers variables to enable or disable conversation aggregations in the `jira__issue_enhanced` table. These settings allow you to manage the amount of data processed and avoid potential performance or limit issues with large datasets.
+
+- `jira_include_conversations`: Controls only the `conversation` [column](https://github.com/fivetran/dbt_jira/blob/main/models/jira.yml#L125-L127) in the `jira__issue_enhanced` table. 
+  - Default: Disabled for Redshift due to string size constraints; enabled for other supported warehouses.
+  - Setting this to `false` removes the `conversation` column but retains the `count_comments` field if `jira_include_comments` is still enabled. This is useful if you want a comment count without the full conversation details.
+
+In your `dbt_project.yml` file:
+
+```yml
+vars:
+  jira_include_conversations: false/true # Disabled by default for Redshift; enabled for other supported warehouses.
+```
+
 #### Define daily issue field history columns
 The `jira__daily_issue_field_history` model generates historical data for the columns specified by the `issue_field_history_columns` variable. By default, the only columns tracked are `status`, `status_id`, and `sprint`, but all fields found in the Jira `FIELD` table's `field_name` column can be included in this model. The most recent value of any tracked column is also captured in `jira__issue_enhanced`.
 
diff --git a/dbt_project.yml b/dbt_project.yml
@@ -1,5 +1,5 @@
 name: 'jira'
-version: '0.18.0'
+version: '0.19.0'
 config-version: 2
 require-dbt-version: [">=1.3.0", "<2.0.0"]
 vars:
diff --git a/docs/catalog.json b/docs/catalog.json
diff --git a/docs/manifest.json b/docs/manifest.json
diff --git a/integration_tests/dbt_project.yml b/integration_tests/dbt_project.yml
@@ -1,9 +1,11 @@
 name: 'jira_integration_tests'
-version: '0.18.0'
+version: '0.19.0'
 config-version: 2
 profile: 'integration_tests'
 
 vars:
+  # Comment out the below when generating docs
+  issue_field_history_columns: ['summary', 'story points', 'components']
   jira_source:
     jira_schema: jira_integrations_tests_41
     jira_comment_identifier: "comment"
@@ -28,9 +30,6 @@ vars:
     jira_user_identifier: "user"
     jira_version_identifier: "version"
 
-  # Comment out the below when generating docs
-  issue_field_history_columns: ['summary', 'story points', 'components']
-
 models:
   jira:
     +schema: "{{ 'jira_integrations_tests_sqlw' if target.name == 'databricks-sql' else 'jira' }}"
diff --git a/integration_tests/seeds/comment.csv b/integration_tests/seeds/comment.csv
diff --git a/integration_tests/tests/consistency/consistency_issue_enhanced.sql b/integration_tests/tests/consistency/consistency_issue_enhanced.sql
@@ -4,13 +4,15 @@
     enabled=var('fivetran_validation_tests_enabled', false)
 ) }}
 
+{# Exclude columns that depend on calculations involving the current time in seconds or aggregate strings in a random order, as they will differ between runs. #}
+{% set exclude_columns = ['open_duration_seconds', 'any_assignment_duration_seconds', 'last_assignment_duration_seconds'] %}
 with prod as (
-    select *
+    select {{ dbt_utils.star(from=ref('jira__issue_enhanced'), except=exclude_columns) }}
     from {{ target.schema }}_jira_prod.jira__issue_enhanced
 ),
 
 dev as (
-    select *
+    select {{ dbt_utils.star(from=ref('jira__issue_enhanced'), except=exclude_columns) }}
     from {{ target.schema }}_jira_dev.jira__issue_enhanced
 ),
 
diff --git a/integration_tests/tests/consistency/consistency_project_enhanced.sql b/integration_tests/tests/consistency/consistency_project_enhanced.sql
@@ -0,0 +1,49 @@
+
+{{ config(
+    tags="fivetran_validations",
+    enabled=var('fivetran_validation_tests_enabled', false)
+) }}
+
+{# Exclude columns that depend on calculations involving the current time in seconds or aggregate strings in a random order, as they will differ between runs. #}
+{% set exclude_columns = ['avg_age_currently_open_seconds', 'avg_age_currently_open_assigned_seconds', 'median_age_currently_open_seconds', 'median_age_currently_open_assigned_seconds', 'epics', 'components'] %}
+
+with prod as (
+    select {{ dbt_utils.star(from=ref('jira__project_enhanced'), except=exclude_columns) }}
+    from {{ target.schema }}_jira_prod.jira__project_enhanced
+),
+
+dev as (
+    select {{ dbt_utils.star(from=ref('jira__project_enhanced'), except=exclude_columns) }}
+    from {{ target.schema }}_jira_dev.jira__project_enhanced
+),
+
+prod_not_in_dev as (
+    -- rows from prod not found in dev
+    select * from prod
+    except distinct
+    select * from dev
+),
+
+dev_not_in_prod as (
+    -- rows from dev not found in prod
+    select * from dev
+    except distinct
+    select * from prod
+),
+
+final as (
+    select
+        *,
+        'from prod' as source
+    from prod_not_in_dev
+
+    union all -- union since we only care if rows are produced
+
+    select
+        *,
+        'from dev' as source
+    from dev_not_in_prod
+)
+
+select *
+from final
diff --git a/integration_tests/tests/consistency/consistency_user_enhanced.sql b/integration_tests/tests/consistency/consistency_user_enhanced.sql
@@ -0,0 +1,49 @@
+
+{{ config(
+    tags="fivetran_validations",
+    enabled=var('fivetran_validation_tests_enabled', false)
+) }}
+
+{# Exclude columns that depend on calculations involving the current time in seconds or aggregate strings in a random order, as they will differ between runs. #}
+{% set exclude_columns = ['avg_age_currently_open_seconds', 'median_age_currently_open_seconds', 'projects'] %}
+
+with prod as (
+    select {{ dbt_utils.star(from=ref('jira__user_enhanced'), except=exclude_columns) }}
+    from {{ target.schema }}_jira_prod.jira__user_enhanced
+),
+
+dev as (
+    select {{ dbt_utils.star(from=ref('jira__user_enhanced'), except=exclude_columns) }}
+    from {{ target.schema }}_jira_dev.jira__user_enhanced
+),
+
+prod_not_in_dev as (
+    -- rows from prod not found in dev
+    select * from prod
+    except distinct
+    select * from dev
+),
+
+dev_not_in_prod as (
+    -- rows from dev not found in prod
+    select * from dev
+    except distinct
+    select * from prod
+),
+
+final as (
+    select
+        *,
+        'from prod' as source
+    from prod_not_in_dev
+
+    union all -- union since we only care if rows are produced
+
+    select
+        *,
+        'from dev' as source
+    from dev_not_in_prod
+)
+
+select *
+from final
diff --git a/macros/jira_is_incremental_compatible.sql b/macros/jira_is_incremental_compatible.sql
@@ -1,14 +1,16 @@
-{% macro jira_is_databricks_sql_warehouse() %}
+{% macro jira_is_incremental_compatible() %}
     {% if target.type in ('databricks') %}
         {% set re = modules.re %}
         {% set path_match = target.http_path %}
-        {% set regex_pattern = "sql/.+/warehouses/" %}
+        {% set regex_pattern = "sql/protocol" %}
         {% set match_result = re.search(regex_pattern, path_match) %}
         {% if match_result %}
             {{ return(True) }}
         {% else %}
             {{ return(False) }}
         {% endif %}
+    {% elif target.type in ('bigquery','snowflake','postgres','redshift','sqlserver') %}
+        {{ return(True) }}
     {% else %}
         {{ return(False) }}
     {% endif %}
diff --git a/models/intermediate/field_history/int_jira__pivot_daily_field_history.sql b/models/intermediate/field_history/int_jira__pivot_daily_field_history.sql
@@ -1,6 +1,6 @@
 {{
     config(
-        materialized='table' if jira.jira_is_databricks_sql_warehouse() else 'incremental',
+        materialized='incremental' if jira_is_incremental_compatible() else 'table',
         partition_by = {'field': 'valid_starting_at_week', 'data_type': 'date'}
             if target.type not in ['spark','databricks'] else ['valid_starting_at_week'],
         cluster_by = ['valid_starting_at_week'],
@@ -176,4 +176,4 @@ final as (
 )
 
 select *
-from final 
+from final 
diff --git a/models/intermediate/int_jira__issue_comments.sql b/models/intermediate/int_jira__issue_comments.sql
@@ -20,14 +20,18 @@ agg_comments as (
 
     select 
     comment.issue_id,
-    {{ fivetran_utils.string_agg( "comment.created_at || '  -  ' || jira_user.user_display_name || ':  ' || comment.body", "'\\n'" ) }} as conversation,
     count(comment.comment_id) as count_comments
 
-    from
-    comment 
+    {%- if var('jira_include_conversations', False if target.type == 'redshift' else True) %}
+    ,{{ fivetran_utils.string_agg(
+        "comment.created_at || '  -  ' || jira_user.user_display_name || ':  ' || comment.body",
+        "'\\n'" ) }} as conversation
+    {% endif %}
+    
+    from comment 
     join jira_user on comment.author_user_id = jira_user.user_id
 
     group by 1
 )
 
-select * from agg_comments
+select * from agg_comments
diff --git a/models/intermediate/int_jira__issue_join.sql b/models/intermediate/int_jira__issue_join.sql
@@ -110,7 +110,7 @@ join_issue as (
         {% endif %}
 
         {% if var('jira_include_comments', True) %}
-        ,issue_comments.conversation
+        {{ ',issue_comments.conversation' if var('jira_include_conversations', False if target.type == 'redshift' else True) }}
         ,coalesce(issue_comments.count_comments, 0) as count_comments
         {% endif %}
     
diff --git a/models/jira.yml b/models/jira.yml
@@ -125,6 +125,7 @@ models:
       - name: conversation
         description: >
           Line-separated list of comments made on this issue, including the timestamp and author name of each comment.
+          (Disabled by default for Redshift.)
       - name: count_comments
         description: The number of comments made on this issues. 
       - name: first_assigned_at
diff --git a/models/jira__daily_issue_field_history.sql b/models/jira__daily_issue_field_history.sql
@@ -1,6 +1,6 @@
 {{
     config(
-        materialized='table' if jira.jira_is_databricks_sql_warehouse() else 'incremental',
+        materialized='incremental' if jira_is_incremental_compatible() else 'table',
         partition_by = {'field': 'date_week', 'data_type': 'date'}
             if target.type not in ['spark', 'databricks'] else ['date_week'],
         cluster_by = ['date_week'],