Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature Request: Cron monitoring #1805

Open
MasterPuffin opened this issue Feb 24, 2025 · 7 comments
Open

Feature Request: Cron monitoring #1805

MasterPuffin opened this issue Feb 24, 2025 · 7 comments

Comments

@MasterPuffin
Copy link

Hi,
unfortunately I couldn't find any mention of cron monitoring on the website, the Readme or the issue, that's why I would like to ask if cron monitoring or request monitoring (meaning a Rest endpoint is provided by Checkmate which can be queried by other tools) is possible or planned?

@gorkem-bwl
Copy link
Contributor

@MasterPuffin can you please provide more information about how it works for you and others, by giving a very simple workflow together with a few use cases? Many thanks.

@ajhollid
Copy link
Collaborator

@MasterPuffin indeed this is an interesting request, please let us know what you're thinking!

@MasterPuffin
Copy link
Author

This is an example workflow, which would fit my use case:
I have a cron script that starts running every day at 4:00. In Checkmate I would configure a new Monitor that creates a new monitor. I would also configure an expected time for the checkin. In return I would get an url like https://mycheckmateinstance.com/endpoint/endpointid/

Then I would configure my cronjob as follows:
When the cronjob starts I call https://mycheckmateinstance.com/endpoint/endpointid/start/. This will return an id which I store in a variable for later use.
During the cronjob I call https://mycheckmateinstance.com/endpoint/endpointid/cronid/update/ for example if a section of my cronjob finishes.
Once the cronjob finishes I call https://mycheckmateinstance.com/endpoint/endpointid/cronid/end/ to let Checkmate know the cronjob finished successfully. Alternativly I can call https://mycheckmateinstance.com/endpoint/endpointid/cronid/error/ with my cron jobs error handler to report an issue.

In Checkmate there would be different states for the Cronjob

  • Cronjob started on time and completed successfully
  • Cronjob started but timed out after the set timeout
  • Cronjob started but reported an error
  • Cronjob didn't start in the configured timeframe

The last three states would be considered an outage and would trigger a notification for example by email.

Interesting at well, would be some kind of statistics, for example past checkins with timestamps so I could examine if one checkin stage requires an abnormal amount of time.

@MasterPuffin
Copy link
Author

For use cases I have a few examples:

  • Clear logs every 2 hours. It would be important to know if this cron fails, eg. if there are missing permissions as there the storage could run full
  • Aggregate data from a database and send them to an user via email, eg. for unread notifications. The job might fail at times because the send limit of the email provider has been reached
  • Query an endpoint eg. Stripe for missed payments. The job might fail because API credentials are no longer valid or rate limits have been reached

@gorkem-bwl
Copy link
Contributor

A few q for understanding the functionality better:

The last three states would be considered an outage and would trigger a notification for example by email.

  • In "cronjob started and sent a started call, but didn't send a finished call case, would there be a configuration which states how long the system would have to wait before sending out a notification, right?

  • If a job doesn’t send a heartbeat within the expected timeframe the system should be alerting you. This should also be configurable, right?

  • The system has to provide logs of all job executions, including timestamps, durations, and exit codes, to help with debugging and auditing, right?

@MasterPuffin
Copy link
Author

In "cronjob started and sent a started call, but didn't send a finished call case, would there be a configuration which states how long the system would have to wait before sending out a notification, right?

Correct. However I don't quite know if there should be a difference between 'The cronjob has made an initial call but no subsequent calls' and 'The cronjob has made no call at all'

If a job doesn’t send a heartbeat within the expected timeframe the system should be alerting you. This should also be configurable, right?

Correct. It would be nice to configure this on a monitor by monitor basis, eg. for one monitor only notify when 7 calls have been missed and for another monitor notify immediatly.

The system has to provide logs of all job executions, including timestamps, durations, and exit codes, to help with debugging and auditing, right?

In an ideal case yes, however simple monitoring without full history would be a very good first step.

@gorkem-bwl
Copy link
Contributor

In an ideal case yes, however simple monitoring without full history would be a very good first step.

In fact keeping a history of everything is way easier than creating settings for each cron monitor, like a configuration about what to do if a particular cron job is expected but not retrieved, or retrieved but late.

For example in order to provide a good solution for this, we need to classify each check like this. Other than New/Up, the system should be able to send a notification to the admin:

  • New: A check that has been created but hasn't received any pings yet.
  • Up: The check is considered active as long as the time since the last ping remains within the defined "Period".
  • Late: The time since the last ping has exceeded Period, but it is still within the additional "Grace".
  • Down: The time since the last ping has surpassed both Period and Grace, marking the check as failed. When this happens, Healthchecks.io sends a notification.

Let's think about this and keep this issue open for now. It may require some changes in the backend (apart from cron monitoring related configs/API endpoint creation).

By the way, I'm starting to feel that calling it "cron monitoring" is too Linux crontab-specific. It can actually be configured to retrieve data from any source and check for errors or inconsistencies.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants