Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Alerts feature bugs and feedback #1186

Open
R4PH1 opened this issue Jan 21, 2025 · 10 comments
Open

Alerts feature bugs and feedback #1186

R4PH1 opened this issue Jan 21, 2025 · 10 comments

Comments

@R4PH1
Copy link

R4PH1 commented Jan 21, 2025

Hi, thanks for the alerts feature in 3.17

First issue:

We use an internal mailing relay on port 25 mail.domain.local.
When I try to send the test mail I did not receive anything. I tried 3 different configurations.
Could it be that Anonymous authentication is not working with dbadash?

Second issue:

I configured a rule for DriveSpace. I somehow ended with an alert counter of 8 but only 4 alerts were visible.
I created the rule with the defaults values and then changed it to 5000 (5GB I assumed) without the percent checkbox.
I then tried to delete the rule and had a foreign key reference conflict and saw there were 4 entries in Alert.ActiveAlerts which i didn´t see in the gui.

Third issue:

When I edited the rule the "Apply To (Tag)" {ALL} was gone/empty.

I hope this is somehow reproducible on your side otherwise I can try to provide some more details.

@DavidWiseman
Copy link
Collaborator

Hi, thanks for providing feedback.

Issue 1:

I believe a fix is required for anonymous authentication to work. I think it's just be a case of removing the call to AuthenticateAsync when there is no username/password supplied. I'll get this fixed.

This await client.AuthenticateAsync(UserName, Password); from here

Note: If you click the notification count link you should be able to see the error message associated with the failed notifications.

Issue 2:

Where did you see the alert counter of 8 - on the menu bar on the top right? This calls Alert.AlertCount_Get which is just doing a COUNT(*) from Alert.ActiveAlerts.
The alerts should display in the GUI if they are in the ActiveAlerts table. The tab does filter the list of instances though so if you are not at root level you might not see all the alerts. The tab might also need a refresh.

For deleting the rule - you will get the FK error if there are any active alerts referencing the rule. I think this can be improved by automatically closing any associated alerts.

Issue 3:

This is just a UI issue - the rule will still apply to all tags. I'll see if I can fix this in the next build though.

Thanks!

DavidWiseman added a commit to DavidWiseman/dba-dash that referenced this issue Jan 21, 2025
The Apply To (Tag) is set to {All} when creating a rule but when editing a rule it's blank. Fixed display issue so it shows as {All} when editing.
trimble-oss#1186
DavidWiseman added a commit to DavidWiseman/dba-dash that referenced this issue Jan 21, 2025
Close any existing alerts that reference a rule before deleting it.
trimble-oss#1186
@DavidWiseman
Copy link
Collaborator

Hi, an updated build is available here.

This should fix the email issue. If you can, please test and let me know if it fixes the issue.

The tag issue is also fixed and the FK error (but this fix won't get deployed unless the Deploy/Update Database is ran from the service config tool. The deploy only runs automatically if there is a version change.)

@R4PH1
Copy link
Author

R4PH1 commented Jan 22, 2025

Thanks a lot for the quick fixes.

The email is now working as expected.

I now noticed why the counter on the top right has a different value from what is shown in the alerts menu. It also includes the "show hidden" instances from the options but the GUI obviously does not display them (as expected). I don´t know what the perfect solution would be for this. For me hidden instances should not report alerts but it might be relevant for other people. Honestly I forgot a bit about them as those hidden ones are mostly test servers / test instances. I could work around that also with the tags. A more reliable solution could be a separate checkbox in the rules configuration to enable or disable it on hidden instances.

The foreign key conflict is now also resolved

The UI issue is also resolved.

I noticed the history did not populate once a alert was resolved by itself, but I guess I gave it too less time when I tested it.

I need to play around a bit more to fully understand the feature and find out what alerts make sense for me

DavidWiseman added a commit that referenced this issue Jan 22, 2025
The Apply To (Tag) is set to {All} when creating a rule but when editing a rule it's blank. Fixed display issue so it shows as {All} when editing.
#1186
DavidWiseman added a commit that referenced this issue Jan 22, 2025
Close any existing alerts that reference a rule before deleting it.
#1186
@DavidWiseman
Copy link
Collaborator

Hi, thanks for confirming that the email is working. 🎉

I think in most cases it would make sense to exclude hidden instances from generating alerts. I usually use this feature when I'm in the process of provisioning a new instance or when decommissioning an old one (Where I want quick access to the old instance for a period of time but no longer want it to appear on the summary page).

An alert will only show up in the history after it's closed rather than resolved. Closing an alert moves it from the Alert.ActiveAlerts table to Alert.ClosedAlerts.

Keeping a resolved alert in the Alert.ActiveAlerts table allows you to see recent issues that have resolved but might require investigation. Also if an alert keeps going from resolved to active, this prevents the alert from sending too many notifications. Alerts will close automatically after 24hrs by default (configurable in Options\Repository Settings). If an alert is closed automatically (or manually) it should be visible in the history.

DavidWiseman added a commit to DavidWiseman/dba-dash that referenced this issue Jan 22, 2025
Add Apply To Hidden option to alert rules.  By default hidden instances will no longer generate alerts.  trimble-oss#1186
@R4PH1
Copy link
Author

R4PH1 commented Jan 22, 2025

Thanks for the explanation.

I found another issue which I think is somewhere in the [Alert].[DriveSpaceAlert_Upd]
When I try to setup an alert for DriveSpace with "Threshold Is Percentage?" to False and set a value like 5000 I don´t get any alerts generated. (I, sadly, for sure have drives with less dann 5GB in the Storage/Drives tab)

Once I set the percentage to enabled and a value of 20 new alerts come in.

Currently I am not sure if it is smart at all to edit existing rules or just delete and create new ones instead. I didn´t have enough time to check the source codes yet.

I think I messed around a bit to much too. What would be the correct way to reset everything configured in alerts? Truncate all the Tables in the Alert Schema (while the collector is stopped)?

@DavidWiseman
Copy link
Collaborator

I see the bug on the drive free space - I'll get it fixed. It's looking for drives with more than 5GB free instead of less. The status of the drive might prevent alerts for drives with >5GB free if use critical status option is checked.

I have a fix that will exclude hidden instances by default. Rules will have an option to apply to hidden instances but I don't expect many people will need this.

If you want to reset everything, just delete the rules created in the GUI. Or just delete or edit the rules you no longer want. Deleting the rules will now clear the active alerts if you deployed the DB. If you just want to close the alerts that are resolved, use the option in the Actions menu.

DavidWiseman added a commit that referenced this issue Jan 22, 2025
Add Apply To Hidden option to alert rules.  By default hidden instances will no longer generate alerts.  #1186
DavidWiseman added a commit to DavidWiseman/dba-dash that referenced this issue Jan 22, 2025
Fix issue alerting on drive space when using MB threshold instead of %.
trimble-oss#1186
DavidWiseman added a commit that referenced this issue Jan 22, 2025
Fix issue alerting on drive space when using MB threshold instead of %.
#1186
@DavidWiseman
Copy link
Collaborator

3.17.1 is now available which should fix the reported issues. As there is a version bump, the DB will be upgraded automatically when the service starts.

@R4PH1
Copy link
Author

R4PH1 commented Jan 23, 2025

Thank you David! All those fixes seem now be working really well.

Another issue I encountered but not entirely sure about:

If a limit is specified for the Drive Space alert e.g. 20000 / (percentage false) and the Use critical status is enabled somehow
previously shown alerts are resolved if the configuration in Storage/Configure Root Thresholds is configured to percentage.
Initially I had percentage values in the drive root thresholds, not in the alert rule. Once I switch it to GB 10 alert/20 warning they came active. I have not tried it with individual configuration of those thresholds per drive which would be also possible. It might be a logic error in the code.

I hope you are still interested in my feedback as I can see more potential for the alerts feature.

The alert history can get full pretty quickly.
Some things which would be nice to have for it:

  • automatic cleanup via your retention logic (sorry if this is already in place, didn´t see it in the data retention tab)
  • Pagination for the alert history window

I have not yet played around with the blackouts but we have weekly maintenance windows, mostly on the weekends. Is it possible to configure a blackout without end date for "infinite" repeats? I still can set them as a workaround also to year 2099.
It would be still nice to be able to copy them like the rules. (copy would also be nice for the notification channels)

Alerts for blocks are not optimal or I didn´t find the proper way to do them yet.
What my workaround alert looks like at the moment:

Image

What I would like to use is the blocked queries counter from the running queries summary but for a timespan of like 2-5 minutes

Another alert which could be useful would be to check against the longest running query. It could be nice to see if someone utilizes queries over half an hour, especially on production critical instances. I didn´t exactly find something fitting in the counters or waits.

@R4PH1 R4PH1 changed the title Alerts feature bugs Alerts feature bugs and feedback Jan 24, 2025
@DavidWiseman
Copy link
Collaborator

If the use critical status is true, an alert will only be generated if the status of the drive is critical based on the drive threshold configuration on the Drives tab. In a addition to that critical status, the threshold on the alert must also be met. If use critical status is false, only the threshold on the alert is considered. e.g.

	WHERE (DS.Status = 1 OR T.UseCriticalStatus=0)
	AND (DS.PctFreeSpace <= (T.Threshold/100.0) OR T.Threshold IS NULL OR T.IsThresholdPercentage=0)
	AND (DS.FreeGB <= T.Threshold/1024.0 OR T.IsThresholdPercentage=1 OR T.Threshold IS NULL)

The use critical status might be useful if you have a critical status set to 5% on the alert. You might have a 16TB drive with 800GB free space that falls just under then 5%. This drive might no longer be growing and you adjusted the critical drive threshold to 500GB. In this case, the drive won't alert if the use critical threshold is true - even though it's under the 5%.
If a rule changes or you clear space on a drive so it's no longer in an alert status, the alert will automatically be resolved.

Ideally, alerts should only be generated when an urgent issue requires your attention. The alert history can fill up though - particularly as you are figuring out which alerts/thresholds work for your environment.

The alert history display is currently limited to 1000 rows in the GUI. The alerts tab at instance level will be filtered for that instance, so you will get a more complete history for an instance at this level. I might improve this with a configurable row limit & maybe some paging.

The alert history currently has no retention options. I added a notes feature to alerts which can be useful. If an alert comes up, you could check the history to see if the alert has occurred previously and what the RCA was. This is potentially valuable data that you might not want to purge. I'll probably add some retention settings at some point. It might default to never or some large number of days. It might have an option to exclude alerts with notes.
Note: Retention works for most things by truncating old partitions which is efficient but doesn't provide any fine grained control. The ClosedAlerts table currently isn't partitioned.

For blackout periods with infinite repeats, just set the date to something in the distant future as you suggested. I might consider extending the copy feature.

For blocking, you could also consider an alert based on wait type using LCK% as the wait type to alert on. I did consider adding a blocking alert based off Running Queries - it would make sense to look at the most recent snapshot (if the blocking is not in the most recent snapshot, the issue is resolved. Or for lots of small blocking events, Waits would be more suitable.) Long running queries could also be useful to alert on.

Thanks

DavidWiseman added a commit to DavidWiseman/dba-dash that referenced this issue Jan 28, 2025
Allow blackout period start/end dates to be NULL.  This makes it easier to configure recurring blackout periods.
trimble-oss#1186
DavidWiseman added a commit that referenced this issue Jan 28, 2025
Allow blackout period start/end dates to be NULL.  This makes it easier to configure recurring blackout periods.
#1186
DavidWiseman added a commit to DavidWiseman/dba-dash that referenced this issue Jan 28, 2025
Add options for data retention to Alert.ClosedAlert table.  Option to exclude alerts with notes from data retention.

trimble-oss#1186
DavidWiseman added a commit that referenced this issue Jan 29, 2025
Add options for data retention to Alert.ClosedAlert table.  Option to exclude alerts with notes from data retention.

#1186
@DavidWiseman
Copy link
Collaborator

Some of the suggestions have been implemented in 3.7.2.

  • Null blackout period start/end dates
  • Data retention for ClosedAlerts table

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants