Bug/string agg limit #133

fivetran-catfritz · 2024-11-08T23:48:48Z

PR Overview

This PR will address the following Issue/Feature:

[Bug] fivetran_utils.string_agg uses occasionally cause failures due to size limitations on redshift #132

This PR will result in the following new package version:

v0.19.0 since we'll be changing default enablement for Redshift.

Please provide the finalized CHANGELOG entry which details the relevant changes included in this PR:

Breaking Changes

This change is marked as breaking due to its impact on Redshift configurations.

For Redshift users, comment data aggregated under the conversations field in the jira__issue_enhanced table is now disabled by default to prevent consistent errors related to Redshift's varchar length limits.

If you wish to re-enable comments on Redshift, set the jira_include_comments variable to true in your dbt_project.yml.

Under the Hood

Updated the comment seed data to ensure comments are correctly disabled for Redshift by default.

Updated the jira_is_databricks_sql_warehouse macro to jira_is_incremental_compatible. This macro now returns true if the Databricks runtime is an all-purpose cluster (previously it checked only for a SQL warehouse runtime) or if the target is any other non-Databricks-supported destination.

This update addresses Databricks runtimes (e.g., endpoints and external runtimes) that do not support the insert_overwrite incremental strategy used in the jira__daily_issue_field_history and int_jira__pivot_daily_field_history models.

For Databricks users, the jira__daily_issue_field_history and int_jira__pivot_daily_field_history models will now apply the incremental strategy only if running on an all-purpose cluster. All other Databricks runtimes will not utilize an incremental strategy.

PR Checklist

Basic Validation

Please acknowledge that you have successfully performed the following commands locally:

dbt run –full-refresh && dbt test
dbt run (if incremental models are present) && dbt test

Before marking this PR as "ready for review" the following have been applied:

The appropriate issue has been linked, tagged, and properly assigned
All necessary documentation and version upgrades have been applied
docs were regenerated (unless this PR does not include any code or yml updates)
BuildKite integration tests are passing
Detailed validation steps have been provided below

Detailed Validation

Please share any and all of your validation steps:

Updated the comment seed data to have a 65534 character long body field (varchar max is 65535) (row 0 'AAA...' below).
jira_include_comments not defaults to false for redshift and does not fail.
confirmed still enabled in another warehouse, like BQ for exampple
consistency tests pass

If you had to summarize this PR in an emoji, which would it be?

💃

…t_jira into bug/string-agg-limit merge conflict

integration_tests/tests/consistency/consistency_project_enhanced.sql

fivetran-joemarkiewicz

@fivetran-catfritz thanks for working through this PR! Overall this PR looks really good, I do have an open ended question which I would like your thoughts around before moving forward.

Let me know if you would like to discuss in more detail.

fivetran-joemarkiewicz · 2024-11-11T16:45:04Z

models/intermediate/int_jira__issue_comments.sql

@@ -1,4 +1,4 @@
-{{ config(enabled=var('jira_include_comments', True)) }}
+{{ config(enabled=var('jira_include_comments', False if target.type == 'redshift' else True)) }}


Taking a closer look at this now, since you were able to identify that the string_agg was likely the only issue I think we may be going overboard with turning off the entirety of this model by default for Redshift users. Especially when there is still some valuable information here (with the count_comments field).

Additionally, if we disable these models by default then we don't end up using the stg_jira__comment staging model at all in the downstream transformations.

Can we instead adjust this model in particular to not disable it entirely, but just provide the conditional on the string_agg to include or not include based on the jira_using_comments variable. This way we only ignore this problematic field for Redshift while still keeping the other aggregate comment field which could be useful in downstream models.

We would then need to make slight adjustments in the int_jira__issue_join model to account for this dynamic functionality.

@fivetran-joemarkiewicz Thanks for the explanation! I updated accordingly, and it's ready for re-review!

fivetran-joemarkiewicz

Thanks for making this updates @fivetran-catfritz! I have two small comments, but not necessary for blocking approval. Once those are updated this will be good for release review!

fivetran-joemarkiewicz · 2024-11-11T21:29:05Z

README.md

+    jira_using_components: false # Enabled by default. Disable if you do not have the component table or do not want component-related metrics reported.
+    jira_using_versions: false   # Enabled by default. Disable if you do not have the versions table or do not want versions-related metrics reported.
+    jira_using_priorities: false # Enabled by default. Disable if you are not using priorities in Jira.
+    jira_include_comments: false # Disabled by default for Redshift, enabled by default for other supported warehouses. This package aggregates issue comments so that you have a single view of all your comments in the jira__issue_enhanced table. This can cause limit errors if you have a large dataset. Disable to remove this functionality.


Can we also document the new variable for conversations

Ahh yes updated!

fivetran-joemarkiewicz · 2024-11-11T22:22:49Z

integration_tests/seeds/comment.csv

Small note, but I noticed the end model results in no conversation or count_comments because of this join. It seems that since we don't have any author_user_id or user_id combinations within the seed data the inner join always results in no records.

Would you be able to update the seed data to have at least one match so we can verify results in the future.

I had updated the seed data to match one, and I'm showing the counts are showing up. Were you using the seed data from my branch?

Very interesting 🤔 thanks for showing the results! It must have been a mismatch on my end with using old seed data in place of your updates. This is all good to me!

fivetran-catfritz

@fivetran-joemarkiewicz thank you! I made the updates, and I think the seed data is ready to go.

fivetran-catfritz · 2024-11-11T23:26:30Z

README.md

+    jira_using_components: false # Enabled by default. Disable if you do not have the component table or do not want component-related metrics reported.
+    jira_using_versions: false   # Enabled by default. Disable if you do not have the versions table or do not want versions-related metrics reported.
+    jira_using_priorities: false # Enabled by default. Disable if you are not using priorities in Jira.
+    jira_include_comments: false # Disabled by default for Redshift, enabled by default for other supported warehouses. This package aggregates issue comments so that you have a single view of all your comments in the jira__issue_enhanced table. This can cause limit errors if you have a large dataset. Disable to remove this functionality.


Ahh yes updated!

fivetran-catfritz · 2024-11-11T23:34:47Z

integration_tests/seeds/comment.csv

I had updated the seed data to match one, and I'm showing the counts are showing up. Were you using the seed data from my branch?

fivetran-jamie

looks good, left some questions and doc suggestions

integration_tests/tests/consistency/consistency_project_enhanced.sql

integration_tests/tests/consistency/consistency_user_enhanced.sql

fivetran-jamie · 2024-11-12T19:42:20Z

README.md

+    jira_using_versions: false   # Enabled by default. Disable if you do not have the versions table or do not want versions-related metrics reported.
+    jira_using_priorities: false # Enabled by default. Disable if you are not using priorities in Jira.
+    jira_include_comments: false # Enabled by default. Disabling will remove the aggregation of comments via the `count_comments` and `conversations` columns in the `jira__issue_enhanced` table.
+    jira_include_conversations: false # Disabled by default for Redshift, enabled by default for other supported warehouses. Controls only the `conversation` column in the `jira__issue_enhanced` table.  Disabling removes `conversation` columns but keeps the `count_comments` field if `jira_include_comments` is still enabled, which can help avoid limit errors with large datasets.


I kinda wonder if this should be in its own section (or subsection of this section) , since there is no conversation source table. It's a little different from the other variables here, so I fear this format might be a lil confusing.

Also, if we do this, we can move the info about the default value out from the yml comment, which you currently need to scroll quite a bit to the right to see the end of https://github.com/fivetran/dbt_jira/tree/bug/string-agg-limit?tab=readme-ov-file#step-4-disable-models-for-non-existent-sources

@fivetran-catfritz

Good point. I moved it!

fivetran-jamie · 2024-11-12T19:51:12Z

CHANGELOG.md

+
+## Under the Hood
+- Updated the `comment` seed data to ensure conversations are correctly disabled for Redshift by default.
+- Updated the `jira_is_databricks_sql_warehouse` macro to `jira_is_incremental_compatible`. This macro now returns `true` if the Databricks runtime is an all-purpose cluster (previously it checked only for a SQL warehouse runtime) or if the target is any other non-Databricks-supported destination.


Just to clarify that it's renamed

Suggested change

- Updated the `jira_is_databricks_sql_warehouse` macro to `jira_is_incremental_compatible`. This macro now returns `true` if the Databricks runtime is an all-purpose cluster (previously it checked only for a SQL warehouse runtime) or if the target is any other non-Databricks-supported destination.

- Renamed the `jira_is_databricks_sql_warehouse` macro to `jira_is_incremental_compatible`. Updated this macro to now return `true` if the Databricks runtime is an all-purpose cluster (previously it checked only for a SQL warehouse runtime) or if the target is any other non-Databricks-supported destination.

fivetran-catfritz

@fivetran-jamie Thanks for reviewing! I have made the updates.

integration_tests/tests/consistency/consistency_project_enhanced.sql

integration_tests/tests/consistency/consistency_user_enhanced.sql

fivetran-catfritz · 2024-11-12T20:05:24Z

README.md

+    jira_using_versions: false   # Enabled by default. Disable if you do not have the versions table or do not want versions-related metrics reported.
+    jira_using_priorities: false # Enabled by default. Disable if you are not using priorities in Jira.
+    jira_include_comments: false # Enabled by default. Disabling will remove the aggregation of comments via the `count_comments` and `conversations` columns in the `jira__issue_enhanced` table.
+    jira_include_conversations: false # Disabled by default for Redshift, enabled by default for other supported warehouses. Controls only the `conversation` column in the `jira__issue_enhanced` table.  Disabling removes `conversation` columns but keeps the `count_comments` field if `jira_include_comments` is still enabled, which can help avoid limit errors with large datasets.


Good point. I moved it!

fivetran-catfritz · 2024-11-12T20:05:37Z

CHANGELOG.md

+
+## Under the Hood
+- Updated the `comment` seed data to ensure conversations are correctly disabled for Redshift by default.
+- Updated the `jira_is_databricks_sql_warehouse` macro to `jira_is_incremental_compatible`. This macro now returns `true` if the Databricks runtime is an all-purpose cluster (previously it checked only for a SQL warehouse runtime) or if the target is any other non-Databricks-supported destination.


fivetran-jamie

looks great! just a few suggestions in the new readme section

README.md

Co-authored-by: Jamie Rodriguez <[email protected]>

fivetran-catfritz added 2 commits November 8, 2024 17:32

bug/string-agg-limit

f4dc4db

update changelog & regen docs

86dc929

fivetran-catfritz self-assigned this Nov 8, 2024

fivetran-catfritz added 6 commits November 8, 2024 17:52

Update int_jira__pivot_daily_field_history.sql

196dbb6

update incremental compatible and regen docs

88400a9

Merge branch 'bug/string-agg-limit' of https://github.com/fivetran/db…

cf7d12a

…t_jira into bug/string-agg-limit merge conflict

regen docs

c9fe823

update yml

655c552

add tests

6ec7632

fivetran-catfritz commented Nov 9, 2024

View reviewed changes

integration_tests/tests/consistency/consistency_project_enhanced.sql Show resolved Hide resolved

fivetran-catfritz requested a review from fivetran-joemarkiewicz November 9, 2024 01:15

fivetran-joemarkiewicz requested changes Nov 11, 2024

View reviewed changes

fivetran-catfritz added 3 commits November 11, 2024 13:26

update conversation disablement

5097e52

regen docs

1f12065

Update int_jira__issue_comments.sql

a3c5468

fivetran-joemarkiewicz approved these changes Nov 11, 2024

View reviewed changes

update readme

e654e99

fivetran-catfritz commented Nov 11, 2024

View reviewed changes

fivetran-jamie reviewed Nov 12, 2024

View reviewed changes

fivetran-catfritz added 2 commits November 12, 2024 14:51

release review updates

d26b848

update readme

27c9c11

fivetran-catfritz commented Nov 12, 2024

View reviewed changes

fivetran-jamie approved these changes Nov 12, 2024

View reviewed changes

README.md Outdated Show resolved Hide resolved

README.md Outdated Show resolved Hide resolved

README.md Outdated Show resolved Hide resolved

Apply suggestions from code review

a49fa75

Co-authored-by: Jamie Rodriguez <[email protected]>

fivetran-catfritz merged commit 9ece92f into main Nov 12, 2024
9 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bug/string agg limit #133

Bug/string agg limit #133

fivetran-catfritz commented Nov 8, 2024 •

edited

Loading

fivetran-joemarkiewicz left a comment

fivetran-joemarkiewicz Nov 11, 2024

fivetran-joemarkiewicz Nov 11, 2024

fivetran-catfritz Nov 11, 2024

fivetran-joemarkiewicz left a comment

fivetran-joemarkiewicz Nov 11, 2024

fivetran-catfritz Nov 11, 2024

fivetran-joemarkiewicz Nov 11, 2024

fivetran-joemarkiewicz Nov 11, 2024

fivetran-catfritz Nov 11, 2024

fivetran-joemarkiewicz Nov 11, 2024

fivetran-catfritz left a comment

fivetran-catfritz Nov 11, 2024

fivetran-catfritz Nov 11, 2024

fivetran-jamie left a comment

fivetran-jamie Nov 12, 2024

fivetran-catfritz Nov 12, 2024

fivetran-jamie Nov 12, 2024

fivetran-catfritz Nov 12, 2024

fivetran-catfritz left a comment

fivetran-catfritz Nov 12, 2024

fivetran-catfritz Nov 12, 2024

fivetran-jamie left a comment

		@@ -1,4 +1,4 @@
		{{ config(enabled=var('jira_include_comments', True)) }}
		{{ config(enabled=var('jira_include_comments', False if target.type == 'redshift' else True)) }}

	- Updated the `jira_is_databricks_sql_warehouse` macro to `jira_is_incremental_compatible`. This macro now returns `true` if the Databricks runtime is an all-purpose cluster (previously it checked only for a SQL warehouse runtime) or if the target is any other non-Databricks-supported destination.
	- Renamed the `jira_is_databricks_sql_warehouse` macro to `jira_is_incremental_compatible`. Updated this macro to now return `true` if the Databricks runtime is an all-purpose cluster (previously it checked only for a SQL warehouse runtime) or if the target is any other non-Databricks-supported destination.

Bug/string agg limit #133

Bug/string agg limit #133

Conversation

fivetran-catfritz commented Nov 8, 2024 • edited Loading

PR Overview

Breaking Changes

Under the Hood

PR Checklist

Basic Validation

Detailed Validation

If you had to summarize this PR in an emoji, which would it be?

fivetran-joemarkiewicz left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

fivetran-joemarkiewicz left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

fivetran-catfritz left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

fivetran-jamie left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

fivetran-catfritz left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

fivetran-jamie left a comment

Choose a reason for hiding this comment

fivetran-catfritz commented Nov 8, 2024 •

edited

Loading