-
Notifications
You must be signed in to change notification settings - Fork 477
Troubleshooting Query Plan Regressions guide #20893
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from 1 commit
eeb57a9
c273994
b5b409a
e6a0bc7
8949a08
d21f6a7
1674f4f
2af6ed9
fbe9373
b892e57
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change | ||||
|---|---|---|---|---|---|---|
| @@ -0,0 +1,141 @@ | ||||||
| --- | ||||||
| title: Troubleshoot Query Plan Regressions | ||||||
| summary: Troubleshooting guide for when the cost-based optimizer chooses a new query plan that slows performance. | ||||||
| keywords: query plan, cost-based optimizer, troubleshooting | ||||||
| toc: true | ||||||
| docs_area: manage | ||||||
| --- | ||||||
|
|
||||||
| This page provides guidance on identifying the source of [query plan]({% link {{page.version.version}}/cost-based-optimizer.md %}) regressions. | ||||||
|
|
||||||
| If the [cost-based optimizer]({% link {{page.version.version}}/cost-based-optimizer.md %}) chooses a new query plan that slows performance, you may observe an unexpected increase in query latency. There are several reasons that the optimizer might choose a plan that increases execution time. This guide will help you understand, identify, and diagnose query plan regressions using built-in CockroachDB tools. | ||||||
|
|
||||||
| ## Before you begin | ||||||
|
|
||||||
| - [Understand how the cost-based optimizer chooses query plans]({% link {{page.version.version}}/cost-based-optimizer.md %}) based on table statistics, and how those statistics are refreshed. | ||||||
|
|
||||||
| ## Query plan regressions vs. suboptimal plans | ||||||
|
||||||
|
|
||||||
| The DB Console's [**Insights** page]({% link {{page.version.version}}/ui-insights-page.md %}) keeps track of [suboptimal plans]({% link {{page.version.version}}/ui-insights-page.md %}#suboptimal-plan). A suboptimal plan is a query plan whose execution time exceeds a certain threshold (configurable with the `sql.insights.latency_threshold` cluster setting) and whose slow execution has caused CockroachDB to generate an index recommendation for the table. Table statistics that were once valid, but which are now stale, can lead to a suboptimal plan scenario. A suboptimal plan scenario does not imply that the query plan has changed, and in fact a failure to change the query plan is often the root problem. The **Insights** page identifies these scenarios and provides recommendations on how to fix them. | ||||||
bsanchez-the-roach marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||||||
|
|
||||||
| A query plan regression occurs when the cost-based optimizer chooses an optimal query plan, but then later it changes that query plan to a less optimal one. It is not the same thing as a suboptimal plan, though it is possible that the conditions that triggered a suboptimal plan insight were caused by a query plan regression. | ||||||
bsanchez-the-roach marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||||||
|
|
||||||
| A suboptimal plan scenario that is not a regression will not increase query latency, as the query plan has not changed. A suboptimal plan scenario is instead the system's failure to decrease latency when it could have. A query plan regression will likely increase query latency. | ||||||
|
|
||||||
| Though these two scenarios are conceptually different, both scenarios will likely require an update to the problematic query plan. | ||||||
|
|
||||||
| ## What to look out for | ||||||
bsanchez-the-roach marked this conversation as resolved.
Show resolved
Hide resolved
|
||||||
|
|
||||||
| Query plan regressions only increase the execution time of SQL statements that use that plan. This means that the overall service latency of the cluster will only be affected during the execution of statements that are run with the problematic query plan. | ||||||
|
||||||
| Query plan regressions only increase the execution time of SQL statements that use that plan. This means that the overall service latency of the cluster will only be affected during the execution of statements that are run with the problematic query plan. | |
| Query plan regressions increase the execution time only for SQL statements that use the affected plan. This means that the overall service latency of the cluster will only be affected during the execution of statements that are run with the problematic query plan. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Query plan regressions only increase the execution time of SQL statements that use the affected plan. This means that the overall service latency of the cluster will only be affected during the execution of statements that are run with the problematic query plan.
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| This might make those latency spikes harder to identify. For example, if the problematic plan only affects a query that's run on an infrequent, ad-hoc basis, it might be difficult to notice a pattern among the graphs on the [**Metrics** page]({% link {{page.version.version}}/ui-overview.md %}#metrics). | |
| As a result, these latency spikes can be harder to identify. For example, if the problematic plan only affects a query that's run on an infrequent, ad-hoc basis, it might be difficult to notice a pattern among the graphs on the [**Metrics** page]({% link {{page.version.version}}/ui-overview.md %}#metrics). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As a result, these latency spikes can be hard to identify. For example, if the problematic plan only affects a query that's run on an infrequent, ad-hoc basis, it might be difficult to notice a pattern among the graphs on the [Metrics page]({% link {{page.version.version}}/ui-overview.md %}#metrics).
bsanchez-the-roach marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
bsanchez-the-roach marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| If you observe that your application is responding more slowly than usual, and this behavior hasn't been explained by recent changes to table schemas or data, or by changes to cluster workloads, it's worth considering a query plan regression. | |
| If you observe that your application is responding more slowly than usual, and it isn’t explained by recent changes to table schemas, data, or cluster workloads, it's worth considering a query plan regression. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you observe that your application is responding more slowly than usual, and this behavior isn’t explained by recent changes to table schemas, data, or cluster workloads, it's worth considering a query plan regression.
bsanchez-the-roach marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As I mentioned in another comment, my understanding of "slow execution" and "suboptimal plan" insights is that they cannot really be used to find or troubleshoot query plan regressions, so I'd remove "Use workload insights" approach altogether.
That said, it might be worth reaching out to TSEs / EEs to check whether their experience matches my understanding.
bsanchez-the-roach marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
bsanchez-the-roach marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
bsanchez-the-roach marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
bsanchez-the-roach marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: capitalize IMPORT and perhaps link to the IMPORT docs page.
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
An increase in Average Rows Read could indicate a query plan regression, since it's possible that the bad query plan is scanning more rows than it should.
But as I think you're intending to show, an increase in Average Rows Read could also indicate that more data was added to the table. It's probably worth mentioning both possibilities here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To me it seems more likely that a significant increase (like an order of magnitude growth) in Average Rows Read is actually due to a plan regression, rather than due to the table size growth, since we're comparing two plans for the given query fingerprint that presumably were executed close - time-wise - to each other. I agree though that both are possibilities.
bsanchez-the-roach marked this conversation as resolved.
Show resolved
Hide resolved
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| 1. In the **Explain Plans** tab, click on the Plan Gist of the more recent plan to see it in more detail. | |
| 1. In the **Explain Plans** tab, click the **Plan Gist** of the more recent plan to view its details. |
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| 2. Click on **All Plans** above to return to the list of plans. | |
| 2. Click on **All Plans** above to return to the full list. |
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: the different portions -> different portions
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| 1. Look at the **Used Indexes** column for the older and the newer query plans. If these aren't the same, it's likely that the creation or deletion of an index resulted in a change to the statement's query plan. | |
| 1. Check the **Used Indexes** column for both the older and newer query plans. If these aren't the same, it's likely that the creation or deletion of an index resulted in a change to the statement's query plan. |
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: s/table/tables/ - it's possible that we have initial scans of multiple tables.
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| 1. In the **Explain Plans** tab, click on the Plan Gist of the more recent plan to see it in more detail. Identify the table used in the initial "scan" step of the plan. | |
| 1. In the **Explain Plans** tab, click the Plan Gist of the more recent plan to view its details. Identify the table used in the initial "scan" step of the plan. |
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| If you suspect that the query plan change is the cause of the latency increase, and you suspect that the query plan changed due to stale statistics, you may want to [manually refresh the statistics for the table]({% link {{ page.version.version }}/create-statistics.md %}#examples). | |
| If you suspect that stale statistics caused the plan change and resulting latency increase, consider [manually refreshing the table’s statistics]({% link {{ page.version.version }}/create-statistics.md %}#examples). |
bsanchez-the-roach marked this conversation as resolved.
Show resolved
Hide resolved
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure there is a good way to determine this without collecting a conditional statement bundle for a slow execution of the statement fingerprint (unless the DB operator happens to know that the application is using a new value for a particular placeholder). Maybe @yuzefovich has another idea?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oof, yeah, this is hard one. The tutorial so far assumes that there is a single good plan for a query fingerprint that might have regressed, but it's actually possible that multiple plans are good, depending on the values of placeholders ("literals").
Here is an example of two different optimal plans (although they do look similar):
CREATE TABLE small (k INT PRIMARY KEY, v INT);
CREATE TABLE large (k INT PRIMARY KEY, v INT, INDEX (v));
INSERT INTO small SELECT i, i FROM generate_series(1, 10) AS g(i);
INSERT INTO large SELECT i, 1 FROM generate_series(1, 10000) AS g(i);
ANALYZE small;
ANALYZE large;
-- this scans `large` on the _left_ side of merge join
EXPLAIN SELECT * FROM small INNER JOIN large ON small.v = large.v AND small.v = 1;
-- this scans `large` on the _right_ side of merge join
EXPLAIN SELECT * FROM small INNER JOIN large ON small.v = large.v AND small.v = 2;Complicating things is that we deal with query fingerprints internally, so all such constants are removed from our observability tooling. If there was an escalation saying that a particular query fingerprint is occasionally slow, similar to Becca I'd have asked for a conditional statement bundle, and then I'd play around locally with different values of placeholders to see whether multiple plans could be chosen based on concrete placeholder values. But so far we've used statement bundles mostly as internal (to Queries team in particular and Cockroach Labs support in general) tooling, so I'd probably not mention going down this route.
Instead, I'd consider suggesting looking into application side to see whether the literal has changed or something like that.
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what should you do
The likely problem is that the query stats don't accurately reflect how this value is represented in the data. This can be fixed by running ANALYZE <table> to refresh the stats for the table. It's also possible that a good index isn't available, which could be fixed by checking the index recommendations displayed by EXPLAIN-ing the query or on the insights page. If none of these options fixes the issue, a more drastic redesign of the schema/application may be needed.
Uh oh!
There was an error while loading. Please reload this page.