Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug/string agg limit #133

Merged
merged 15 commits into from
Nov 12, 2024
Merged

Bug/string agg limit #133

merged 15 commits into from
Nov 12, 2024

Conversation

fivetran-catfritz
Copy link
Contributor

@fivetran-catfritz fivetran-catfritz commented Nov 8, 2024

PR Overview

This PR will address the following Issue/Feature:

This PR will result in the following new package version:

  • v0.19.0 since we'll be changing default enablement for Redshift.

Please provide the finalized CHANGELOG entry which details the relevant changes included in this PR:

Breaking Changes

  • This change is marked as breaking due to its impact on Redshift configurations.
  • For Redshift users, comment data aggregated under the conversations field in the jira__issue_enhanced table is now disabled by default to prevent consistent errors related to Redshift's varchar length limits.
    • If you wish to re-enable comments on Redshift, set the jira_include_comments variable to true in your dbt_project.yml.

Under the Hood

  • Updated the comment seed data to ensure comments are correctly disabled for Redshift by default.
  • Updated the jira_is_databricks_sql_warehouse macro to jira_is_incremental_compatible. This macro now returns true if the Databricks runtime is an all-purpose cluster (previously it checked only for a SQL warehouse runtime) or if the target is any other non-Databricks-supported destination.
    • This update addresses Databricks runtimes (e.g., endpoints and external runtimes) that do not support the insert_overwrite incremental strategy used in the jira__daily_issue_field_history and int_jira__pivot_daily_field_history models.
  • For Databricks users, the jira__daily_issue_field_history and int_jira__pivot_daily_field_history models will now apply the incremental strategy only if running on an all-purpose cluster. All other Databricks runtimes will not utilize an incremental strategy.

PR Checklist

Basic Validation

Please acknowledge that you have successfully performed the following commands locally:

  • dbt run –full-refresh && dbt test
  • dbt run (if incremental models are present) && dbt test

Before marking this PR as "ready for review" the following have been applied:

  • The appropriate issue has been linked, tagged, and properly assigned
  • All necessary documentation and version upgrades have been applied
  • docs were regenerated (unless this PR does not include any code or yml updates)
  • BuildKite integration tests are passing
  • Detailed validation steps have been provided below

Detailed Validation

Please share any and all of your validation steps:

  • Updated the comment seed data to have a 65534 character long body field (varchar max is 65535) (row 0 'AAA...' below).

  • jira_include_comments not defaults to false for redshift and does not fail.

    • Screenshot 2024-11-08 at 6 38 50 PM
  • confirmed still enabled in another warehouse, like BQ for exampple

    • Screenshot 2024-11-08 at 6 49 06 PM
  • consistency tests pass

    • Screenshot 2024-11-08 at 7 11 34 PM

If you had to summarize this PR in an emoji, which would it be?

💃

@fivetran-catfritz fivetran-catfritz self-assigned this Nov 8, 2024
Copy link
Contributor

@fivetran-joemarkiewicz fivetran-joemarkiewicz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@fivetran-catfritz thanks for working through this PR! Overall this PR looks really good, I do have an open ended question which I would like your thoughts around before moving forward.

Let me know if you would like to discuss in more detail.

@@ -1,4 +1,4 @@
{{ config(enabled=var('jira_include_comments', True)) }}
{{ config(enabled=var('jira_include_comments', False if target.type == 'redshift' else True)) }}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Taking a closer look at this now, since you were able to identify that the string_agg was likely the only issue I think we may be going overboard with turning off the entirety of this model by default for Redshift users. Especially when there is still some valuable information here (with the count_comments field).

Additionally, if we disable these models by default then we don't end up using the stg_jira__comment staging model at all in the downstream transformations.

image

Can we instead adjust this model in particular to not disable it entirely, but just provide the conditional on the string_agg to include or not include based on the jira_using_comments variable. This way we only ignore this problematic field for Redshift while still keeping the other aggregate comment field which could be useful in downstream models.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We would then need to make slight adjustments in the int_jira__issue_join model to account for this dynamic functionality.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@fivetran-joemarkiewicz Thanks for the explanation! I updated accordingly, and it's ready for re-review!

Copy link
Contributor

@fivetran-joemarkiewicz fivetran-joemarkiewicz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for making this updates @fivetran-catfritz! I have two small comments, but not necessary for blocking approval. Once those are updated this will be good for release review!

README.md Outdated
jira_using_components: false # Enabled by default. Disable if you do not have the component table or do not want component-related metrics reported.
jira_using_versions: false # Enabled by default. Disable if you do not have the versions table or do not want versions-related metrics reported.
jira_using_priorities: false # Enabled by default. Disable if you are not using priorities in Jira.
jira_include_comments: false # Disabled by default for Redshift, enabled by default for other supported warehouses. This package aggregates issue comments so that you have a single view of all your comments in the jira__issue_enhanced table. This can cause limit errors if you have a large dataset. Disable to remove this functionality.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we also document the new variable for conversations

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ahh yes updated!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Small note, but I noticed the end model results in no conversation or count_comments because of this join. It seems that since we don't have any author_user_id or user_id combinations within the seed data the inner join always results in no records.

Would you be able to update the seed data to have at least one match so we can verify results in the future.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

image

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had updated the seed data to match one, and I'm showing the counts are showing up. Were you using the seed data from my branch?

Screenshot 2024-11-11 at 5 33 42 PM

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very interesting 🤔 thanks for showing the results! It must have been a mismatch on my end with using old seed data in place of your updates. This is all good to me!

Copy link
Contributor Author

@fivetran-catfritz fivetran-catfritz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@fivetran-joemarkiewicz thank you! I made the updates, and I think the seed data is ready to go.

README.md Outdated
jira_using_components: false # Enabled by default. Disable if you do not have the component table or do not want component-related metrics reported.
jira_using_versions: false # Enabled by default. Disable if you do not have the versions table or do not want versions-related metrics reported.
jira_using_priorities: false # Enabled by default. Disable if you are not using priorities in Jira.
jira_include_comments: false # Disabled by default for Redshift, enabled by default for other supported warehouses. This package aggregates issue comments so that you have a single view of all your comments in the jira__issue_enhanced table. This can cause limit errors if you have a large dataset. Disable to remove this functionality.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ahh yes updated!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had updated the seed data to match one, and I'm showing the counts are showing up. Were you using the seed data from my branch?

Screenshot 2024-11-11 at 5 33 42 PM

Copy link
Contributor

@fivetran-jamie fivetran-jamie left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good, left some questions and doc suggestions

README.md Outdated
jira_using_versions: false # Enabled by default. Disable if you do not have the versions table or do not want versions-related metrics reported.
jira_using_priorities: false # Enabled by default. Disable if you are not using priorities in Jira.
jira_include_comments: false # Enabled by default. Disabling will remove the aggregation of comments via the `count_comments` and `conversations` columns in the `jira__issue_enhanced` table.
jira_include_conversations: false # Disabled by default for Redshift, enabled by default for other supported warehouses. Controls only the `conversation` column in the `jira__issue_enhanced` table. Disabling removes `conversation` columns but keeps the `count_comments` field if `jira_include_comments` is still enabled, which can help avoid limit errors with large datasets.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I kinda wonder if this should be in its own section (or subsection of this section) , since there is no conversation source table. It's a little different from the other variables here, so I fear this format might be a lil confusing.

Also, if we do this, we can move the info about the default value out from the yml comment, which you currently need to scroll quite a bit to the right to see the end of https://github.com/fivetran/dbt_jira/tree/bug/string-agg-limit?tab=readme-ov-file#step-4-disable-models-for-non-existent-sources

@fivetran-catfritz

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. I moved it!

CHANGELOG.md Outdated

## Under the Hood
- Updated the `comment` seed data to ensure conversations are correctly disabled for Redshift by default.
- Updated the `jira_is_databricks_sql_warehouse` macro to `jira_is_incremental_compatible`. This macro now returns `true` if the Databricks runtime is an all-purpose cluster (previously it checked only for a SQL warehouse runtime) or if the target is any other non-Databricks-supported destination.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to clarify that it's renamed

Suggested change
- Updated the `jira_is_databricks_sql_warehouse` macro to `jira_is_incremental_compatible`. This macro now returns `true` if the Databricks runtime is an all-purpose cluster (previously it checked only for a SQL warehouse runtime) or if the target is any other non-Databricks-supported destination.
- Renamed the `jira_is_databricks_sql_warehouse` macro to `jira_is_incremental_compatible`. Updated this macro to now return `true` if the Databricks runtime is an all-purpose cluster (previously it checked only for a SQL warehouse runtime) or if the target is any other non-Databricks-supported destination.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated!

Copy link
Contributor Author

@fivetran-catfritz fivetran-catfritz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@fivetran-jamie Thanks for reviewing! I have made the updates.

README.md Outdated
jira_using_versions: false # Enabled by default. Disable if you do not have the versions table or do not want versions-related metrics reported.
jira_using_priorities: false # Enabled by default. Disable if you are not using priorities in Jira.
jira_include_comments: false # Enabled by default. Disabling will remove the aggregation of comments via the `count_comments` and `conversations` columns in the `jira__issue_enhanced` table.
jira_include_conversations: false # Disabled by default for Redshift, enabled by default for other supported warehouses. Controls only the `conversation` column in the `jira__issue_enhanced` table. Disabling removes `conversation` columns but keeps the `count_comments` field if `jira_include_comments` is still enabled, which can help avoid limit errors with large datasets.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. I moved it!

CHANGELOG.md Outdated

## Under the Hood
- Updated the `comment` seed data to ensure conversations are correctly disabled for Redshift by default.
- Updated the `jira_is_databricks_sql_warehouse` macro to `jira_is_incremental_compatible`. This macro now returns `true` if the Databricks runtime is an all-purpose cluster (previously it checked only for a SQL warehouse runtime) or if the target is any other non-Databricks-supported destination.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated!

Copy link
Contributor

@fivetran-jamie fivetran-jamie left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks great! just a few suggestions in the new readme section

README.md Outdated Show resolved Hide resolved
README.md Outdated Show resolved Hide resolved
README.md Outdated Show resolved Hide resolved
@fivetran-catfritz fivetran-catfritz merged commit 9ece92f into main Nov 12, 2024
9 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants