Skip to content

Conversation

@alchemydc
Copy link

Basic Prometheus / Grafana Monitoring & Alerting for Z3

Summary

This PR creates baseline monitoring capabilities for the Z3 stack. It creates a Grafana dashboard with a professional layout and useful metrics, introduces basic alerting for critical issues, and provides basic documentation.

Prometheus scraping and Grafana visualization and alerting for key Zebra metrics. Easily extensible to support Zaino and
Zallet metrics.

Key Changes

📊 Enhanced Grafana Dashboard

  • Metrics:
    • Network Traffic: Added Inbound/Outbound data rates (bytes/sec) and Message rates.
    • Mempool: Added Transaction count and Total size (bytes).
    • Verification: Added Proof verification rate.
    • Peers: Added Peer Count Gauge and User Agent distribution pie chart.

🚨 Alerting

  • Provisioned Alerts: Added alerting.yaml with two critical alerts:
    • Low Peer Count: Triggers if peers < 1 for 5 minutes.
    • Block Height Stalled: Triggers if block height doesn't change for 15 minutes.
  • Robustness: Configured alerts to trigger on No Data, ensuring notifications are sent even if the node goes down completely and metrics stop reporting.

🔒 Security

  • Secure Defaults: Grafana defaults to admin/admin and forces a password change on first login. The new password is persisted in the grafana_data volume.

📚 Documentation

  • Updated README.md:
    • Added Monitoring section with dashboard access and metric details.
    • Added Alerting section with configuration guides for Contact Points.
    • Added Local Directory Support guide for prometheus_data and grafana_data.
    • Updated Docker Images and Volumes tables.
    • Added instructions for disabling monitoring.

How to Test

  1. Start the Stack: docker compose up -d
  2. Access Grafana: Go to http://localhost:3000. Login with admin/admin and set a new password.
  3. Verify Dashboard: Check the Zebra Status dashboard. Confirm all panels (Network, Mempool, Peers) are populating with data.
  4. Verify Alerts:
    • Go to Alerting -> Alert rules.
    • Stop the Zebra container: docker compose stop zebra.
    • Wait ~5 minutes. Verify that the alerts transition from Normal to Alerting (or "Firing").

Pics or GTFO

image image

…ring prometheus and grafana data in custom paths rather than (default) docker volumes.
…ock height not increasing over time, update Grafana configuration and documentation
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant