-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
New Time Slice SLO #20888
Merged
Merged
New Time Slice SLO #20888
Changes from all commits
Commits
Show all changes
42 commits
Select commit
Hold shift + click to select a range
9eedf24
Add time slice to left nav
estherk15 82c8767
Add time slice instructions and images
estherk15 7538e96
Add uptime calculations page
estherk15 1d64886
Add uptime calculations to left nav
estherk15 c7ea2a3
Merge branch 'master' into esther/docs-6808-time-slice-slo
estherk15 adf9578
Merge branch 'master' into esther/docs-6808-time-slice-slo
estherk15 27eb837
Standardize use of Time Slice SLO
estherk15 be1fe17
Remove duplicate file
estherk15 c38a9cd
Merge branch 'esther/docs-6808-time-slice-slo' of github.com:DataDog/…
estherk15 a0b5ba4
Merge uptime with time slice
estherk15 f4a67c0
Merge branch 'master' into esther/docs-6808-time-slice-slo
estherk15 1294f9d
Add SLO comparison chart
estherk15 d7a35d4
Merge branch 'master' into esther/docs-6808-time-slice-slo
estherk15 049744e
Apply code review suggestions
estherk15 7100145
Merge branch 'esther/docs-6808-time-slice-slo' of github.com:DataDog/…
estherk15 4d251f7
Merge branch 'esther/docs-6808-time-slice-slo' of github.com:DataDog/…
estherk15 79a973f
Merge branch 'esther/docs-6808-time-slice-slo' of github.com:DataDog/…
estherk15 95ab6d2
Update content/en/service_management/service_level_objectives/_index.md
estherk15 6389e6c
Apply suggestions from code review
estherk15 efc364f
Apply suggestions from code review, removed commented examples
estherk15 a47fdc7
Add time slice to left nav
estherk15 54d4b96
Add time slice instructions and images
estherk15 490229a
Add uptime calculations page
estherk15 35bc038
Add uptime calculations to left nav
estherk15 34abbfc
Standardize use of Time Slice SLO
estherk15 6a277e6
Remove duplicate file
estherk15 3daa809
Merge uptime with time slice
estherk15 15b3754
Add SLO comparison chart
estherk15 24aa777
Apply code review suggestions
estherk15 43033ad
Update content/en/service_management/service_level_objectives/_index.md
estherk15 b7548a8
Apply suggestions from code review
estherk15 62b88dc
Apply suggestions from code review, removed commented examples
estherk15 36c20d3
Add examples with images
estherk15 24d2e8e
Resolve merge conflicts
estherk15 ee5b98b
Merge branch 'master' into esther/docs-6808-time-slice-slo
estherk15 ecd395f
minor changes
roxanne-moslehi 99ed33a
API info comparison chart
roxanne-moslehi a9ccc35
update comparison chart
roxanne-moslehi 772793a
update comparison chart again
roxanne-moslehi eff235b
fix status correction info
roxanne-moslehi 1764a80
update SLO definitions
roxanne-moslehi 61b31a6
calendar view info
roxanne-moslehi File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
106 changes: 62 additions & 44 deletions
106
content/en/service_management/service_level_objectives/_index.md
Large diffs are not rendered by default.
Oops, something went wrong.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
53 changes: 53 additions & 0 deletions
53
...nt/en/service_management/service_level_objectives/guide/slo_types_comparison.md
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,53 @@ | ||
--- | ||
title: SLO Type Comparison | ||
kind: Guide | ||
further_reading: | ||
- link: "/service_management/service_level_objectives/" | ||
tag: "Documentation" | ||
text: "Overview of Service Level Objectives" | ||
- link: "/service_management/service_level_objectives/metric/" | ||
tag: "Documentation" | ||
text: "Metric-based SLOs" | ||
- link: "/service_management/service_level_objectives/monitor/" | ||
tag: "Documentation" | ||
text: "Monitor-based SLOs" | ||
- link: "/service_management/service_level_objectives/time_slice/" | ||
tag: "Documentation" | ||
text: "Time Slice SLOs" | ||
--- | ||
|
||
## Overview | ||
|
||
When creating SLOs, you can choose from the following types: | ||
- **Metric-based SLOs**: can be used when you want the SLI calculation to be count-based, the SLI is calculated as the sum of good events divided by the sum of total events. | ||
- **Monitor-based SLOs**: can be used when you want the SLI calculation to be time-based, the SLI is based on the Monitor's uptime. Monitor-based SLOs must be based on a new or existing Datadog monitor, any adjustments must be made to the underlying monitor (cannot be done through SLO creation). | ||
- **Time Slice SLOs**: can be used when you want the SLI calculation to be time-based, the SLI is based on your custom uptime definition (amount of time your system exhibits good behavior divided by the total time). Time Slice SLOs do not require a Datadog monitor, you can try out different metric filters and thresholds and instantly explore downtime during SLO creation. | ||
|
||
## Comparison chart | ||
|
||
| | **Metric-based SLO** | **Monitor-based SLO** | **Time Slice SLO** | | ||
|-----------------------------------------------------------------------|-------------------------------------------------------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------|--------------------------------------------------------------------------------------| | ||
| **Supported data types** | Metrics with type of count, rate, or distribution | Metric Monitor types, Synthetic Monitors, and Service Checks | All metric types (including gauge metrics) | | ||
| **Functionality for SLO with Groups** | SLO calculated based on all groups<br><br>Can view all groups in SLO side panel, up to 20 groups in SLO summary widget | Supported for SLOs with a single multi alert Monitor<br><br>**Option 1:** SLO calculated based on all groups (can view 5 worst groups in SLO side panel and SLO summary widget)<br>**Option 2:** SLO calculated based on up to 20 selected groups (can view all selected groups in SLO side panel and SLO summary widget) | SLO calculated based on all groups<br><br>Can view all groups in SLO side panel, up to 20 groups in SLO summary widget | | ||
| **SLO details side panel (up to 90 days of historical data)** | Can set custom time windows to view SLO info | Cannot set custom time windows to view SLO info (can view 7, 30, or 90 day history) | Can set custom time windows to view SLO info | | ||
| **SLO alerting ([Error Budget][1] or [Burn Rate][2] Alerts)** | Available | Available for SLOs based on Metric Monitor types only (not available for Synthetic Monitors or Service Checks) | Not available | | ||
| [**SLO Status Corrections**][3] | Correction periods are ignored from SLO status calculation | Correction periods are ignored from SLO status calculation | Correction periods are counted as uptime in SLO status calculation | | ||
| **[SLO Widgets][4] (up to 90 days of historical data)** | Available | Available | Available | | ||
| [**SLO Data Source**][5] | Available (with up to 15 months of historical data) | Not available | Not available | | ||
| **Handling missing data in the SLO calculation** | Missing data is ignored in SLO status and error budget calculations | Missing data is handled based on the [underlying Monitor's configuration][6] | Missing data is treated as uptime in SLO status and error budget calculations | | ||
| **Uptime Calculations** | N/A | Uptime calculations are based on the underlying Monitor <br><br>If groups are present, overall uptime requires *all* groups to have uptime| [Uptime][7] is calculated by looking at discrete time chunks, not rolling time windows<br><br>If groups are present, overall uptime requires *all* groups to have uptime | | ||
| **Calendar View on SLO Manage Page** | Available | Not available | Available | | ||
| **Public [APIs][8] and Terraform Support** | Available | Available | Not available | | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. For time-slices can we say "Not yet available" or "Coming soon"? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Same as above |
||
|
||
## Further Reading | ||
|
||
{{< partial name="whats-next/whats-next.html" >}} | ||
|
||
[1]: https://docs.datadoghq.com/service_management/service_level_objectives/error_budget/ | ||
[2]: https://docs.datadoghq.com/service_management/service_level_objectives/burn_rate/ | ||
[3]: https://docs.datadoghq.com/service_management/service_level_objectives/#slo-status-corrections | ||
[4]: https://docs.datadoghq.com/service_management/service_level_objectives/#slo-widgets | ||
[5]: https://docs.datadoghq.com/dashboards/guide/slo_data_source/ | ||
[6]: https://docs.datadoghq.com/service_management/service_level_objectives/monitor/#missing-data | ||
[7]: /service_management/service_level_objectives/time_slice/#uptime-calculations | ||
[8]: https://docs.datadoghq.com/api/latest/service-level-objectives/ |
95 changes: 95 additions & 0 deletions
95
content/en/service_management/service_level_objectives/time_slice.md
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,95 @@ | ||
--- | ||
title: Time Slice SLOs | ||
kind: documentation | ||
is_beta: true | ||
further_reading: | ||
- link: "service_management/service_level_objectives/" | ||
tag: "Documentation" | ||
text: "Overview of Service Level Objectives" | ||
--- | ||
|
||
{{< jqmath-vanilla >}} | ||
|
||
## Overview | ||
|
||
Time Slice SLOs allow you to measure reliability using a custom definition of uptime. You define uptime as a condition over a metric timeseries. For example, you can create a latency SLO by defining uptime as whenever p95 latency is less than 1 second. | ||
|
||
Time Slice SLOs are a convenient alternative to Monitor-based SLOs. You can create an uptime SLO without going through a monitor, so you don't have to create and maintain both a monitor and an SLO. | ||
|
||
## Create a Time Slice SLO | ||
|
||
You can create a Time Slice SLO through the following ways: | ||
- [Create an SLO from the create page](#create-an-slo-from-the-create-page) | ||
- [Export an existing Monitor-based SLO](#export-an-existing-monitor-slo) | ||
- [Import from a monitor](#import-from-a-monitor) | ||
|
||
### Create an SLO from the create page | ||
|
||
{{< img src="service_management/service_level_objectives/time_slice/create_and_configuration.png" alt="Configuration options to create a Time Slice SLO" style="width:100%;" >}} | ||
|
||
1. Navigate to [**Service Management > SLOs**][1]. | ||
1. Click **+ New SLO** to open up the Create SLO page. | ||
estherk15 marked this conversation as resolved.
Show resolved
Hide resolved
|
||
1. Select **By Time Slices** to define your SLo measurement. | ||
estherk15 marked this conversation as resolved.
Show resolved
Hide resolved
|
||
1. Define your uptime condition by choosing a metric query, comparator and threshold. For example, to define uptime as whenever p95 latency is less than 1s. Alternatively, you can [import the uptime from a monitor](#import-from-a-monitor). | ||
estherk15 marked this conversation as resolved.
Show resolved
Hide resolved
|
||
1. Choose your timeframe and target | ||
estherk15 marked this conversation as resolved.
Show resolved
Hide resolved
|
||
1. Name and tag your SLO. | ||
1. Click **Create**. | ||
|
||
### Export an existing monitor SLO | ||
|
||
<div class="alert alert-info">Only single metric monitor SLOs can be exported. Non-metric monitors or multi-monitor SLOs cannot be exported.</div> | ||
|
||
Create a Time Slice SLO by exporting an existing Monitor-based SLO. From a monitor SLO, click **Export to Time Slice SLO**. | ||
|
||
{{< img src="service_management/service_level_objectives/time_slice/export_monitor_slo.png" alt="On a Monitor-based SLO detail side panel, the button to Export to Time Slice is highlighted" style="width:90%;" >}} | ||
|
||
### Import from a monitor | ||
|
||
<div class="alert alert-info">Only metric monitor SLOs appear in the monitor selection for import. </div> | ||
|
||
From the **Create or Edit SLO** page, under **Define your SLI**, click **Import from Monitor** and select from the dropdown or search in the monitor selector. | ||
|
||
**Note**: Time Slice SLOs do not support rolling periods. Rolling periods do not transfer from a monitor query to a Time Slice query. | ||
|
||
{{< img src="service_management/service_level_objectives/time_slice/import_from_monitor.png" alt="Highlighted option to Import From Monitor in the Define your SLI section of an SLO configuration" style="width:90%;" >}} | ||
|
||
## Uptime calculations | ||
|
||
To calculate the uptime percentage for a Time Slice SLOs, Datadog cuts the timeseries into equal-duration intervals, called "slices". The length of the interval is 5 minutes and not configurable. The space and time aggregation are determined by the metric query. For more information on time and space aggregation, see the [metrics][2] documentation. | ||
|
||
For each slice, there is a single value for the timeseries, and the uptime condition (such as `value < 1`) is evaluated for each slice. If the condition is met, the slice is considered uptime, otherwise it is considered downtime. | ||
|
||
{{< img src="service_management/service_level_objectives/time_slice/uptime_latency.png" alt="Time Slice SLO detail panel showing application latency with one uptime violation" style="width:100%;" >}} | ||
|
||
For the above example, exactly one point in the timeseries violates the uptime condition (in this case, the condition is that the p95 latency is less than or equal to 2.5 seconds). Since the total time period shown is 12 hours (720 minutes), and 715 minutes are considered uptime (720 min total time - 5 min downtime), the uptime percentage is 715/720 * 100 = 99.305% | ||
|
||
### Groups and overall uptime | ||
|
||
Time Slice SLOs allow you to track uptime for individual groups, where groups are defined in the "group by" portion of the metric query. | ||
|
||
When groups are present, uptime is calculated for each individual group. However, overall uptime works differently. In order to match existing monitor SLO functionality, Time Slice SLOs use the same definition of overall uptime. When **all** groups have uptime, it is considered overall uptime. Conversely, if **any** group has downtime, it is considered overall downtime. Overall uptime is always less than the uptime for any individual group. | ||
|
||
{{< img src="service_management/service_level_objectives/time_slice/uptime_latency_groups.png" alt="Time Slice SLO detail panel of application latency uptime with groups" style="width:100%;" >}} | ||
|
||
In the example above, environment "prod" has 5 minutes of downtime over a 12 hour (720 minute) period, resulting in approximately 715/720 * 100 = 99.305% of uptime. Environment "dev" also had 5 minutes of downtime, resulting in the same uptime. This means that overall downtime--when either datacenter prod or dev had downtime--was 10 minutes (since there is no overlap), resulting in approximately (720-10)/720 * 100 = 98.611% uptime. | ||
|
||
### Corrections | ||
|
||
Time Slice SLOs count correction periods as uptime in all calculations. Since the total time remains constant, the error budget is always a fixed amount of time as well. This is a significant simplification and improvement over how corrections are handled for monitor-based SLOs. | ||
|
||
For monitor-based SLOs, corrections are periods that are removed from the calculation. If a one-day-long correction is added to a 7-day SLO, 1 hour of downtime counts as 0.7% instead of 0.6%: | ||
|
||
$$ 60/8640 *100 = ~0.7% $$ | ||
|
||
The effects on error budget can be unusual. Removing time from an uptime SLO causes time dilation, where each minute of downtime represents a larger fraction of the total time. | ||
|
||
### Missing data | ||
|
||
In Time Slice SLOs, missing data is always treated as uptime. While missing data is treated as uptime, it is gray on the timeline visualization. | ||
|
||
## Further Reading | ||
|
||
{{< partial name="whats-next/whats-next.html" >}} | ||
|
||
[1]: https://app.datadoghq.com/slo/manage | ||
[2]: /metrics/#time-and-space-aggregation |
Binary file added
BIN
+619 KB
static/images/service_management/service_level_objectives/calendar-view-slo.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added
BIN
+292 KB
...ice_management/service_level_objectives/time_slice/create_and_configuration.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added
BIN
+129 KB
...s/service_management/service_level_objectives/time_slice/export_monitor_slo.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added
BIN
+93.3 KB
.../service_management/service_level_objectives/time_slice/import_from_monitor.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added
BIN
+504 KB
...anagement/service_level_objectives/time_slice/time_slice_detail_panel_group.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added
BIN
+237 KB
...mages/service_management/service_level_objectives/time_slice/uptime_latency.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added
BIN
+494 KB
...ervice_management/service_level_objectives/time_slice/uptime_latency_groups.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For time slices can we say "Coming soon".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We try to stay away from future promises in docs. We can update the docs as soon as features are available!