Skip to content

Monitoring & Observability #201

Description

@jmgilman

Monitoring & Observability

Overview

Implement comprehensive monitoring and observability features including Prometheus metrics, health check endpoints, structured logging, and audit logging to ensure operational visibility and system reliability.

Objective

Create a robust monitoring system that provides real-time insights into service health, performance metrics, and security events while maintaining detailed audit trails for compliance and troubleshooting.

Canonical Scope

  • This document is the canonical source for:
    • Structured logging approach and helpers
    • Request logging middleware behavior
    • Health and readiness endpoints, semantics, and status criteria
    • Metrics definitions and exposure
  • For audit storage and retention, see 05 Database Layer. For validation and error schema, see 07 Security & Validation.

Tasks

Prometheus Metrics Implementation

  • Set up Prometheus metrics in internal/monitoring/metrics.go
  • Implement request metrics:
    var (
      RequestsTotal = prometheus.NewCounterVec(
        prometheus.CounterOpts{
          Name: "certificate_api_requests_total",
          Help: "Total number of API requests",
        },
        []string{"method", "status"},
      )
    
      RequestDuration = prometheus.NewHistogramVec(
        prometheus.HistogramOpts{
          Name: "certificate_api_request_duration_seconds",
          Help: "Request duration in seconds",
          Buckets: prometheus.DefBuckets,
        },
        []string{"method", "quantile"},
      )
    )
  • Implement certificate metrics:
    ActiveCertificates = prometheus.NewGaugeVec(
      prometheus.GaugeOpts{
        Name: "certificate_api_active_certificates",
        Help: "Number of active certificates",
      },
      []string{"ca"},
    )
    
    CertificatesExpiringSoon = prometheus.NewGaugeVec(
      prometheus.GaugeOpts{
        Name: "certificate_api_certificates_expiring_soon",
        Help: "Number of certificates expiring soon",
      },
      []string{"days", "ca"},
    )
    
    CertificatesIssuedTotal = prometheus.NewCounterVec(
      prometheus.CounterOpts{
        Name: "certificate_api_certificates_issued_total",
        Help: "Total number of certificates issued",
      },
      []string{"profile", "ca"},
    )
    
    CertificatesRenewedTotal = prometheus.NewCounterVec(
      prometheus.CounterOpts{
        Name: "certificate_api_certificates_renewed_total",
        Help: "Total number of certificates renewed",
      },
      []string{"profile", "ca"},
    )
  • Implement system health metrics:
    ServiceUp = prometheus.NewGauge(
      prometheus.GaugeOpts{
        Name: "certificate_api_up",
        Help: "Service availability (1 = up, 0 = down)",
      },
    )
    
    DatabaseConnectionsActive = prometheus.NewGauge(
      prometheus.GaugeOpts{
        Name: "certificate_api_database_connections_active",
        Help: "Number of active database connections",
      },
    )
    
    PCAAPICalls = prometheus.NewCounterVec(
      prometheus.CounterOpts{
        Name: "certificate_api_pca_api_calls_total",
        Help: "Total number of AWS PCA API calls",
      },
      []string{"operation", "status"},
    )
    
    PCAAPILatency = prometheus.NewHistogramVec(
      prometheus.HistogramOpts{
        Name: "certificate_api_pca_api_latency_seconds",
        Help: "AWS PCA API call latency",
        Buckets: prometheus.DefBuckets,
      },
      []string{"operation", "quantile"},
    )
  • Implement security metrics:
    AuthenticationFailures = prometheus.NewCounterVec(
      prometheus.CounterOpts{
        Name: "certificate_api_authentication_failures_total",
        Help: "Total number of authentication failures",
      },
      []string{"reason"},
    )
  • Register all metrics with Prometheus registry
  • Create metrics endpoint on port 9090 as specified

Health Check Implementation

  • Implement health check endpoint in internal/monitoring/health.go
  • Create health check handler at /health:
    type HealthCheck struct {
      Status     string                 `json:"status"`
      Checks     map[string]CheckResult `json:"checks"`
      Timestamp  time.Time             `json:"timestamp"`
    }
    
    type CheckResult struct {
      Status  string `json:"status"`
      Message string `json:"message,omitempty"`
    }
  • Implement health check criteria:
    • Database connectivity: Connection pool has available connections
    • AWS PCA connectivity: Can list CAs successfully
    • JWKS cache status: Cache populated and not expired
    • Certificate expiry: No CA certificates expiring within 30 days
  • Return appropriate response codes:
    • 200: All checks healthy
    • 503: Any check unhealthy
  • Implement readiness check endpoint at /ready
  • Ensure health checks complete in <100ms as specified

Structured Logging with slog

  • Implement structured JSON logging using standard library slog
  • Configure slog in internal/logging/logger.go:
    func NewLogger(level slog.Level) *slog.Logger {
      opts := &slog.HandlerOptions{
        Level: level,
        AddSource: false,
      }
      handler := slog.NewJSONHandler(os.Stdout, opts)
      return slog.New(handler)
    }
  • Create context-aware logging helpers:
    func LoggerWithRequestID(logger *slog.Logger, requestID string) *slog.Logger {
      return logger.With("request_id", requestID)
    }
    
    func LoggerWithComponent(logger *slog.Logger, component string) *slog.Logger {
      return logger.With("component", component)
    }
  • Include standard attributes in log entries:
    • Timestamp (automatic with slog)
    • Level (debug, info, warn, error)
    • Request ID (via context)
    • Component/module
    • Message
  • Implement request logging middleware using slog:
    • Log request start with method, path, client IP
    • Log request completion with status, duration
    • Include request ID for correlation
  • Configure log levels via configuration (default: info):
    var logLevel = new(slog.LevelVar) // can be changed at runtime
    logLevel.Set(slog.LevelInfo)
  • Ensure sensitive data is not logged (tokens, keys, etc.)

Audit Logging System

  • Implement audit logging in internal/monitoring/audit.go
  • Create audit logger that writes to database:
    type AuditLogger struct {
      repository AuditRepository
      logger     *slog.Logger
    }
  • Log the following audit events:
    • Certificate issued/renewed/requested
    • Authentication failures
    • Authorization failures
    • Admin operations
    • System errors
    • CA certificate refresh operations
    • CA certificate expiry warnings
  • Ensure audit log format includes:
    • Timestamp
    • Actor identity
    • Actor IP address
    • Resource type and identifier
    • Action/event type
    • Outcome (success/failure)
    • Additional details in JSONB
  • Implement audit log retention (1 year minimum)
  • Ensure audit logs are write-only (no updates/deletes)

Metrics Collection Jobs

  • Create background job to collect certificate metrics:
    func CollectCertificateMetrics(repo CertificateRepository) {
      // Run every 5 minutes
      // Count active certificates per CA
      // Count expiring certificates (7, 3, 1 days)
      // Update Prometheus gauges
    }
  • Monitor database connection pool metrics
  • Track AWS PCA API call patterns
  • Monitor CA certificate expiration dates

Performance Monitoring

  • Add request timing to all API endpoints
  • Track database query performance
  • Monitor AWS PCA API latency
  • Implement performance targets verification:
    • Certificate issuance: <2 seconds (95th percentile)
    • Certificate status lookup: <500ms (95th percentile)
    • Health check: <100ms

Alerting Configuration

  • Define alert rules for Prometheus:
    • Service down (certificate_api_up == 0)
    • High error rate (>1% of requests failing)
    • CA certificate expiring soon (<30 days)
    • Database connection pool exhausted
    • Authentication failure spike
    • AWS PCA API errors
  • Document alert thresholds and escalation paths

Gin Middleware Integration

  • Create Prometheus middleware for Gin:
    func PrometheusMiddleware() gin.HandlerFunc {
      return func(c *gin.Context) {
        start := time.Now()
    
        c.Next()
    
        duration := time.Since(start)
        status := strconv.Itoa(c.Writer.Status())
    
        RequestsTotal.WithLabelValues(c.Request.Method, status).Inc()
        RequestDuration.WithLabelValues(c.Request.Method, "p95").Observe(duration.Seconds())
      }
    }
  • Integrate with existing Gin router
  • Ensure metrics don't impact request performance

Acceptance Criteria

  • All Prometheus metrics implemented and exposed
  • Health check endpoint returning correct status
  • Health check completes in <100ms
  • Structured JSON logging working
  • Audit events logged to database
  • Metrics endpoint accessible on port 9090
  • Request tracing with request IDs
  • No sensitive data in logs
  • Performance metrics tracking accurately
  • Background metrics collection running

Technical Considerations

  • Use official Prometheus Go client library
  • Implement efficient metrics collection (avoid blocking)
  • Use appropriate metric types (Counter, Gauge, Histogram)
  • Ensure metric cardinality is controlled
  • Use context for request-scoped logging
  • Consider log aggregation requirements
  • Implement graceful shutdown for metrics server

Dependencies

Testing Requirements

  • Unit tests for metrics collection
  • Unit tests for health checks
  • Unit tests for audit logging
  • Integration tests for metrics endpoint
  • Test metric accuracy under load
  • Test health check failure scenarios
  • Verify audit log completeness
  • Test log output format
  • Performance impact testing

Definition of Done

  • Code reviewed and approved
  • All tests passing
  • Metrics documented
  • Grafana dashboards created (if applicable)
  • Alert rules configured
  • Logging standards documented
  • No performance regression
  • Audit trail verified complete

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions