Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Production - [Alerting] DotNetEng Status Failed Requests/Hour alert #4920

Open
dotnet-eng-status bot opened this issue Feb 5, 2025 · 6 comments
Open
Assignees
Labels
Critical Grafana Alert Issues opened by Grafana Inactive Alert Issues from Grafana alerts that are now "OK" Ops - First Responder Production Tied to the Production environment (as opposed to Staging)

Comments

@dotnet-eng-status
Copy link

💔 Metric state changed to alerting

The number of failed DotNetEng Status requests per hour is above 20. This may indicate a systemic problem that needs to be investigated.
To intially investigate prod, run the following query in DotNetEng-Status-Prod, and to investigate staging, run the query in DotNetEng-Status-Staging:

union exceptions, traces
| project timestamp, operation_Name, customDimensions, message, problemId, details
| order by timestamp asc
  • failuresCount 191.2

Go to rule

@dotnet/dnceng, @dotnet/prodconsvcs, please investigate

Automation information below, do not change

Grafana-Automated-Alert-Id-d2dd705a6c724ed68fcf6955561c06dd

@dotnet-eng-status dotnet-eng-status bot added Active Alert Issues from Grafana alerts that are now active Critical Grafana Alert Issues opened by Grafana Ops - First Responder Production Tied to the Production environment (as opposed to Staging) labels Feb 5, 2025
Copy link
Author

💔 Metric state changed to alerting

The number of failed DotNetEng Status requests per hour is above 20. This may indicate a systemic problem that needs to be investigated.
To intially investigate prod, run the following query in DotNetEng-Status-Prod, and to investigate staging, run the query in DotNetEng-Status-Staging:

union exceptions, traces
| project timestamp, operation_Name, customDimensions, message, problemId, details
| order by timestamp asc
  • failuresCount 191.2

Go to rule

@dotnet-eng-status dotnet-eng-status bot added Inactive Alert Issues from Grafana alerts that are now "OK" and removed Active Alert Issues from Grafana alerts that are now active labels Feb 5, 2025
Copy link
Author

💚 Metric state changed to ok

The number of failed DotNetEng Status requests per hour is above 20. This may indicate a systemic problem that needs to be investigated.
To intially investigate prod, run the following query in DotNetEng-Status-Prod, and to investigate staging, run the query in DotNetEng-Status-Staging:

union exceptions, traces
| project timestamp, operation_Name, customDimensions, message, problemId, details
| order by timestamp asc

Go to rule

Copy link
Author

💚 Metric state changed to ok

The number of failed DotNetEng Status requests per hour is above 20. This may indicate a systemic problem that needs to be investigated.
To intially investigate prod, run the following query in DotNetEng-Status-Prod, and to investigate staging, run the query in DotNetEng-Status-Staging:

union exceptions, traces
| project timestamp, operation_Name, customDimensions, message, problemId, details
| order by timestamp asc

Go to rule

@haruna99 haruna99 self-assigned this Feb 11, 2025
@haruna99
Copy link
Contributor

The number of failed DotNetEng Status requests per hour is below 20. Closing issue

@garath
Copy link
Member

garath commented Feb 11, 2025

The number of failed DotNetEng Status requests per hour is below 20. Closing issue

Did the failed requests fall low because the overall request count also went low? Or perhaps there is a class of request that's failing that spiked in that window but continues to fail. It would be good to dig a bit to understand what was failing.

@haruna99
Copy link
Contributor

The number of failed DotNetEng Status requests per hour is below 20. Closing issue

Did the failed requests fall low because the overall request count also went low? Or perhaps there is a class of request that's failing that spiked in that window but continues to fail. It would be good to dig a bit to understand what was failing.

I will conduct further investigation to better understand the cause of the failure.

@haruna99 haruna99 reopened this Feb 11, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Critical Grafana Alert Issues opened by Grafana Inactive Alert Issues from Grafana alerts that are now "OK" Ops - First Responder Production Tied to the Production environment (as opposed to Staging)
Projects
None yet
Development

No branches or pull requests

2 participants