Skip to content

Latest commit

 

History

History
391 lines (317 loc) · 13.9 KB

File metadata and controls

391 lines (317 loc) · 13.9 KB

Solace Cache Helm Chart - Implementation Context

Chart Version: 0.1.0
App Version: 1.0.11
Created: April 2026
Purpose: Production-ready Helm chart for deploying Solace Cache instances (Linux C API application)


Architecture Overview

Design Philosophy: "Pets, Not Cattle"

  • These cache instances are stateful "pets" that must maintain maximum uptime
  • Each pod has a unique identity and specific configuration
  • Focus on HA protection and graceful handling of disruptions
  • Use StatefulSet for stable pod identities and ordered deployment

Key Components

  1. StatefulSet - Main workload (not Deployment)
  2. Headless Service - DNS resolution for pods (no inbound ports needed)
  3. ConfigMap - Template config with placeholders for per-pod substitution
  4. Secret - Broker credentials (username/password)
  5. PodDisruptionBudget - Prevents voluntary disruptions from affecting availability
  6. Init Container - Generates unique config per pod before main container starts

Critical Implementation Details

1. Per-Pod Configuration Strategy

Problem: Each cache instance needs a unique CACHE_INSTANCE_NAME that matches the broker's distributed cache configuration.

Solution: A busybox init container extracts the pod ordinal and selects from the instanceNames list (passed in by Helm as a space-separated string), then substitutes placeholders in the config template. Credentials are read per-ordinal from a mounted secret. No yq dependency — plain POSIX shell:

# In init container (see templates/statefulset.yaml)
POD_ORDINAL=$(echo $HOSTNAME | grep -o '[0-9]*$')
INSTANCE_NAMES="{{ join " " .Values.solaceCache.instanceNames }}"
INSTANCE_NAME=$(echo $INSTANCE_NAMES | cut -d' ' -f$((POD_ORDINAL + 1)))

Substitution is done with awk, not sed: the values (instance name plus __SESSION_USERNAME__ / __SESSION_PASSWORD__ read per-ordinal from the mounted secret) are exported into the environment and replaced with a literal substring subst() function. This avoids sed mangling credentials that contain its delimiter, the & replacement metachar, or a backslash:

# values passed via ENVIRON, replaced literally (no regex/delimiter interpretation)
$0 = subst($0, "__CACHE_INSTANCE_NAME__", ENVIRON["INSTANCE_NAME"])
$0 = subst($0, "__SESSION_USERNAME__",    ENVIRON["USERNAME"])
$0 = subst($0, "__SESSION_PASSWORD__",    ENVIRON["PASSWORD"])

2. Signal Handling for Fast Shutdown

Problem: Container was taking 30 seconds to shut down (SIGTERM timeout).

Solution: The container runs a wrapper script (templates/wrapper-script-configmap.yaml, mounted at /scripts/cache-wrapper.sh) that launches SolaceCache in the background and installs a trap on TERM/INT to forward SIGTERM to the cache process and wait for it to exit cleanly:

trap "kill -TERM ${CACHE_PID}; wait ${CACHE_PID}; ...; exit 0" TERM INT

The wrapper also tails the cache logs to drive the readiness probe (see below). Result: Shutdown time reduced from 30s to ~2s.

Note: an earlier approach used exec so SolaceCache became PID 1 directly. That was replaced by the wrapper once we needed log-watching for readiness; the wrapper now owns PID 1 and is responsible for signal forwarding.

3. Config Change Detection

Problem: Kubernetes doesn't restart pods when a referenced ConfigMap changes.

Solution: Add a checksum annotation to the pod template:

annotations:
  checksum/config: {{ include (print $.Template.BasePath "/configmap.yaml") . | sha256sum }}

Result: A config change advances the StatefulSet revision. Under RollingUpdate this auto-restarts pods; under the default OnDelete it marks pods as "needs update" so the operator can apply it on their own schedule (see Gotcha #5). Either way, pods only pick up new config when they restart.

4. Health Checks: Liveness vs Readiness

Liveness uses a process check. The process name is SolaceCache with capital S and C:

livenessProbe:
  exec:
    command: ["pgrep", "-f", "SolaceCache"]

NOT: "solcache" or "solaceCache" - must match exactly.

Readiness is driven by the wrapper script, which watches the cache logs and toggles two marker files. The probe is ready only when both exist:

readinessProbe:
  exec:
    command: ["sh", "-c", "[ -f /tmp/cache-state-up ] && [ -f /tmp/lost-msg-clear ]"]
  • /tmp/cache-state-up — created on State changed to: UP, removed on any other state change.
  • /tmp/lost-msg-clear — created on LOST_MSG_STATE_CLEAR, removed on LOST_MSG_STATE_SET.

This requires INFO-level CACHE_LOG_LEVEL (the wrapper greps the log lines), which is why debugCacheLogLevel: false yields INFO rather than a quieter level.


File Structure

Configuration Files

  • values.yaml - Base configuration with sensible defaults

    • 2 replicas (HA by default)
    • INFO log levels
    • PodDisruptionBudget enabled
    • Preferred pod anti-affinity
  • values-standalone.yaml - Single instance for dev/UAT

    • 1 replica but with PDB protection
    • DEBUG log levels
    • Higher probe failure thresholds
  • values-prod-ha.yaml - Production overrides only

    • Custom registry and image pull secrets
    • Higher resource limits (4Gi/4000m)
    • SSL enabled for broker connection
    • Required (strict) pod anti-affinity

Templates

  • statefulset.yaml - Main workload with init container and wrapper-script container
  • service.yaml - Headless service (clusterIP: None)
  • configmap.yaml - Config template with placeholders
  • wrapper-script-configmap.yaml - cache-wrapper.sh: launches SolaceCache, forwards signals, drives readiness markers
  • secret.yaml - Broker credentials, keyed username-<ordinal> / password-<ordinal>
  • poddisruptionbudget.yaml - HA protection (minAvailable: 1)
  • serviceaccount.yaml - Optional RBAC

(No Helm test is shipped: the liveness probe already verifies the SolaceCache process continuously, which is what a one-shot test would have checked. The cache image is Ubuntu + the binary only - no kubectl - so an in-cluster exec-based test would also need a kubectl image plus exec RBAC.)

Operator Tooling (not part of the packaged chart)

  • scripts/kubectl-backup-cache - SEMP-driven cache backup plugin
  • scripts/kubectl-restore-cache - SEMP-driven cache restore plugin
  • scripts/copy-cache-contents.sh - simpler standalone backup helper

Documentation

  • README.md - Complete usage guide
  • QUICKSTART.md - Fast deployment instructions
  • NOTES.txt - Post-install instructions displayed to user
  • PROJECT_CONTEXT.md - This file

Configuration Parameters

Key Settings in values.yaml

Instance Identity

solaceCache:
  instanceNames:
    - "cache-instance-0"  # Must match broker config
    - "cache-instance-1"
  distributedCacheName: "my-distributed-cache"

Broker Connection

  broker:
    host: "tcp://solace-broker:55555"
    vpn: "default"
    usernames:                  # one per replica; single entry reused for all
      - "cache-user"
    passwords:                  # one per replica; or use existingSecret
      - "cache-password"
    existingSecret: ""          # Recommended for production (keys username-N/password-N)

Logging

  settings:
    sdkLogLevel: "NOTICE"       # Solace API/SDK logging
    debugCacheLogLevel: false   # false => CACHE_LOG_LEVEL INFO (required by readiness wrapper); true => DEBUG

HA Protection

podDisruptionBudget:
  enabled: true
  minAvailable: 1  # At least 1 pod must remain during disruptions

Deployment Patterns

Development/UAT (Single Instance with Protection)

helm install my-cache . -f values-standalone.yaml
  • 1 replica with PDB enabled (maximum uptime)
  • DEBUG logging for troubleshooting
  • Suitable for non-production environments

Production HA

helm install prod-cache . -f values-prod-ha.yaml
  • 2 replicas on separate nodes (required anti-affinity)
  • SSL-enabled broker connection
  • Higher resource allocations
  • Secrets-based credentials

Upgrading Configuration

helm upgrade my-cache . -f my-values.yaml
  • Default strategy is OnDelete: this stages the change but does NOT restart pods. Apply it manually, one ordinal at a time: kubectl delete pod <name>-1, verify, then <name>-0.
  • With updateStrategy.type: RollingUpdate, the upgrade auto-restarts pods in reverse ordinal order (1, then 0); PDB keeps ≥1 pod available during the roll.

Testing and Validation

Pre-Install Validation

helm lint .
helm template test . --debug

Post-Install Verification

# Check pod status
kubectl get pods -l app.kubernetes.io/name=solace-cache

# View logs
kubectl logs -f solace-cache-0
kubectl logs -f solace-cache-1

# Check config generated correctly
kubectl exec solace-cache-0 -- cat /home/solace/config/config.txt

Health Check

kubectl exec solace-cache-0 -- pgrep -f SolaceCache

Should return PID. If empty, container is not running correctly.


Known Issues and Gotchas

1. Process Name Must Be Exact

  • Health probes use pgrep -f SolaceCache
  • Must match capital S and C
  • Tests use same pattern

2. Init Container Uses Plain Shell (busybox)

  • The init container is busybox and parses with POSIX shell (grep/cut/awk) — no yq
  • It extracts the pod ordinal from $HOSTNAME and picks the matching instance name
  • The instanceNames list is injected by Helm at render time as a space-separated string

3. PDB and Single Replica

  • PDB with minAvailable: 1 on single replica prevents all voluntary disruptions
  • This is intentional for "pets" philosophy
  • Node drains will be blocked until PDB is deleted or pod is force-evicted

4. Image Repository Placeholder

  • Default values.yaml has placeholder: your-registry/solace-cache
  • Must be updated before deployment
  • Production values already override this

5. StatefulSet Update Strategy (manual restart by default)

  • Configurable via updateStrategy in values; defaults to OnDelete
  • OnDelete: helm upgrade stages changes but does NOT restart pods — the operator applies them by deleting pods (kubectl delete pod <name>), one ordinal at a time, at a chosen time. Chosen as the default because cache instances are "pets" whose restart timing should be operator-controlled.
  • RollingUpdate: Kubernetes auto-restarts pods (highest ordinal first) when the pod template changes; supports staged rollouts via rollingUpdate.partition. values-standalone.yaml uses this for dev convenience.
  • Note: each pod generates its config once at startup (init container → emptyDir), so a changed ConfigMap has no effect on a running pod until it restarts under either strategy. The checksum/config pod annotation advances the StatefulSet revision on config change, which (a) auto-triggers restart under RollingUpdate and (b) marks pods as "needs update" (a staged-change signal) under OnDelete.

Future Enhancements

Potential Improvements

  1. Monitoring - Add Prometheus metrics if SolaceCache exposes them
  2. Backup Strategy - Document or automate cache state backup
  3. Multi-Region - Add topology spread constraints for zone awareness
  4. Readiness Gates - External validation before marking pod ready
  5. Config Validation - Pre-flight checks in init container
  6. Syslog Integration - Enable syslog forwarding for centralized logging

Not Implemented (By Design)

  • HPA - Autoscaling disabled; replicas should be manually controlled
  • Ingress - No inbound traffic; cache connects to broker only
  • Persistence - Uses emptyDir; cache state is ephemeral per pod lifecycle

Troubleshooting Guide

Pods Not Starting

kubectl describe pod solace-cache-0
kubectl logs solace-cache-0 -c init-config  # Check init container
  • Verify instanceNames array has enough entries for replica count
  • Check secret exists if using existingSecret
  • Ensure image is pullable from registry

Slow Shutdown

  • Verify exec wrapper is present in StatefulSet command
  • Check if process is running as PID 1: kubectl exec pod -- ps aux

Config Not Updating

  • Check if checksum annotation is in statefulset.yaml pod template
  • Verify ConfigMap was actually changed
  • Force restart: kubectl rollout restart statefulset/solace-cache

Health Probes Failing

  • Verify process name: kubectl exec pod -- pgrep -f SolaceCache
  • Check startup time; may need longer initialDelaySeconds
  • Review logs for crash loops

PDB Blocking Node Drain

kubectl get pdb
kubectl describe pdb solace-cache
  • Expected behavior for "pets" with single replica
  • Temporarily disable: kubectl delete pdb solace-cache
  • Re-enable after drain: helm upgrade --reuse-values

Packaging and Distribution

Create Chart Archive

cd /path/to/solace-cache
helm package .

Produces: solace-cache-0.1.0.tgz

Install from Package

helm install my-cache solace-cache-0.1.0.tgz -f my-values.yaml

Version Management

Update Chart.yaml version field, then repackage:

version: 0.2.0  # Chart version
appVersion: 1.0.12  # Application version

Repository Information

  • GitHub: Posted by user (April 2026)
  • Chart Type: Application
  • License: Not specified
  • Maintainer: Add to Chart.yaml if needed

Summary

This Helm chart implements a production-ready deployment for Solace Cache with:

  • ✅ Unique per-pod configuration via init container
  • ✅ Fast graceful shutdown via signal-forwarding wrapper script
  • ✅ Operator-controlled (manual) restarts by default; RollingUpdate optional
  • ✅ HA protection with PodDisruptionBudget
  • ✅ Node distribution via pod anti-affinity
  • ✅ Multiple deployment profiles (dev/prod)
  • ✅ Comprehensive documentation and testing

Philosophy: Treat cache instances as stateful "pets" requiring maximum uptime and careful handling during disruptions. The chart prioritizes availability and correctness over scalability.