Solace Cache Helm Chart - Implementation Context

Chart Version: 0.1.0
App Version: 1.0.11
Created: April 2026
Purpose: Production-ready Helm chart for deploying Solace Cache instances (Linux C API application)

Architecture Overview

Design Philosophy: "Pets, Not Cattle"

These cache instances are stateful "pets" that must maintain maximum uptime
Each pod has a unique identity and specific configuration
Focus on HA protection and graceful handling of disruptions
Use StatefulSet for stable pod identities and ordered deployment

Key Components

StatefulSet - Main workload (not Deployment)
Headless Service - DNS resolution for pods (no inbound ports needed)
ConfigMap - Template config with placeholders for per-pod substitution
Secret - Broker credentials (username/password)
PodDisruptionBudget - Prevents voluntary disruptions from affecting availability
Init Container - Generates unique config per pod before main container starts

Critical Implementation Details

1. Per-Pod Configuration Strategy

Problem: Each cache instance needs a unique CACHE_INSTANCE_NAME that matches the broker's distributed cache configuration.

Solution: A busybox init container extracts the pod ordinal and selects from the instanceNames list (passed in by Helm as a space-separated string), then substitutes placeholders in the config template. Credentials are read per-ordinal from a mounted secret. No yq dependency — plain POSIX shell:

# In init container (see templates/statefulset.yaml)
POD_ORDINAL=$(echo $HOSTNAME | grep -o '[0-9]*$')
INSTANCE_NAMES="{{ join " " .Values.solaceCache.instanceNames }}"
INSTANCE_NAME=$(echo $INSTANCE_NAMES | cut -d' ' -f$((POD_ORDINAL + 1)))

Substitution is done with awk, not sed: the values (instance name plus __SESSION_USERNAME__ / __SESSION_PASSWORD__ read per-ordinal from the mounted secret) are exported into the environment and replaced with a literal substring subst() function. This avoids sed mangling credentials that contain its delimiter, the & replacement metachar, or a backslash:

# values passed via ENVIRON, replaced literally (no regex/delimiter interpretation)
$0 = subst($0, "__CACHE_INSTANCE_NAME__", ENVIRON["INSTANCE_NAME"])
$0 = subst($0, "__SESSION_USERNAME__",    ENVIRON["USERNAME"])
$0 = subst($0, "__SESSION_PASSWORD__",    ENVIRON["PASSWORD"])

2. Signal Handling for Fast Shutdown

Problem: Container was taking 30 seconds to shut down (SIGTERM timeout).

Solution: The container runs a wrapper script (templates/wrapper-script-configmap.yaml, mounted at /scripts/cache-wrapper.sh) that launches SolaceCache in the background and installs a trap on TERM/INT to forward SIGTERM to the cache process and wait for it to exit cleanly:

trap "kill -TERM ${CACHE_PID}; wait ${CACHE_PID}; ...; exit 0" TERM INT

The wrapper also tails the cache logs to drive the readiness probe (see below). Result: Shutdown time reduced from 30s to ~2s.

Note: an earlier approach used exec so SolaceCache became PID 1 directly. That was replaced by the wrapper once we needed log-watching for readiness; the wrapper now owns PID 1 and is responsible for signal forwarding.

3. Config Change Detection

Problem: Kubernetes doesn't restart pods when a referenced ConfigMap changes.

Solution: Add a checksum annotation to the pod template:

annotations:
  checksum/config: {{ include (print $.Template.BasePath "/configmap.yaml") . | sha256sum }}

Result: A config change advances the StatefulSet revision. Under RollingUpdate this auto-restarts pods; under the default OnDelete it marks pods as "needs update" so the operator can apply it on their own schedule (see Gotcha #5). Either way, pods only pick up new config when they restart.

4. Health Checks: Liveness vs Readiness

Liveness uses a process check. The process name is SolaceCache with capital S and C:

livenessProbe:
  exec:
    command: ["pgrep", "-f", "SolaceCache"]

NOT: "solcache" or "solaceCache" - must match exactly.

Readiness is driven by the wrapper script, which watches the cache logs and toggles two marker files. The probe is ready only when both exist:

readinessProbe:
  exec:
    command: ["sh", "-c", "[ -f /tmp/cache-state-up ] && [ -f /tmp/lost-msg-clear ]"]

/tmp/cache-state-up — created on State changed to: UP, removed on any other state change.
/tmp/lost-msg-clear — created on LOST_MSG_STATE_CLEAR, removed on LOST_MSG_STATE_SET.

This requires INFO-level CACHE_LOG_LEVEL (the wrapper greps the log lines), which is why debugCacheLogLevel: false yields INFO rather than a quieter level.

File Structure

Configuration Files

values.yaml - Base configuration with sensible defaults
- 2 replicas (HA by default)
- INFO log levels
- PodDisruptionBudget enabled
- Preferred pod anti-affinity
values-standalone.yaml - Single instance for dev/UAT
- 1 replica but with PDB protection
- DEBUG log levels
- Higher probe failure thresholds
values-prod-ha.yaml - Production overrides only
- Custom registry and image pull secrets
- Higher resource limits (4Gi/4000m)
- SSL enabled for broker connection
- Required (strict) pod anti-affinity

Templates

statefulset.yaml - Main workload with init container and wrapper-script container
service.yaml - Headless service (clusterIP: None)
configmap.yaml - Config template with placeholders
wrapper-script-configmap.yaml - cache-wrapper.sh: launches SolaceCache, forwards signals, drives readiness markers
secret.yaml - Broker credentials, keyed username-<ordinal> / password-<ordinal>
poddisruptionbudget.yaml - HA protection (minAvailable: 1)
serviceaccount.yaml - Optional RBAC

(No Helm test is shipped: the liveness probe already verifies the SolaceCache process continuously, which is what a one-shot test would have checked. The cache image is Ubuntu + the binary only - no kubectl - so an in-cluster exec-based test would also need a kubectl image plus exec RBAC.)

Operator Tooling (not part of the packaged chart)

scripts/kubectl-backup-cache - SEMP-driven cache backup plugin
scripts/kubectl-restore-cache - SEMP-driven cache restore plugin
scripts/copy-cache-contents.sh - simpler standalone backup helper

Documentation

README.md - Complete usage guide
QUICKSTART.md - Fast deployment instructions
NOTES.txt - Post-install instructions displayed to user
PROJECT_CONTEXT.md - This file

Configuration Parameters

Key Settings in values.yaml

Instance Identity

solaceCache:
  instanceNames:
    - "cache-instance-0"  # Must match broker config
    - "cache-instance-1"
  distributedCacheName: "my-distributed-cache"

Broker Connection

  broker:
    host: "tcp://solace-broker:55555"
    vpn: "default"
    usernames:                  # one per replica; single entry reused for all
      - "cache-user"
    passwords:                  # one per replica; or use existingSecret
      - "cache-password"
    existingSecret: ""          # Recommended for production (keys username-N/password-N)

Logging

  settings:
    sdkLogLevel: "NOTICE"       # Solace API/SDK logging
    debugCacheLogLevel: false   # false => CACHE_LOG_LEVEL INFO (required by readiness wrapper); true => DEBUG

HA Protection

podDisruptionBudget:
  enabled: true
  minAvailable: 1  # At least 1 pod must remain during disruptions

Deployment Patterns

Development/UAT (Single Instance with Protection)

helm install my-cache . -f values-standalone.yaml

1 replica with PDB enabled (maximum uptime)
DEBUG logging for troubleshooting
Suitable for non-production environments

Production HA

helm install prod-cache . -f values-prod-ha.yaml

2 replicas on separate nodes (required anti-affinity)
SSL-enabled broker connection
Higher resource allocations
Secrets-based credentials

Upgrading Configuration

helm upgrade my-cache . -f my-values.yaml

Default strategy is OnDelete: this stages the change but does NOT restart pods. Apply it manually, one ordinal at a time: kubectl delete pod <name>-1, verify, then <name>-0.
With updateStrategy.type: RollingUpdate, the upgrade auto-restarts pods in reverse ordinal order (1, then 0); PDB keeps ≥1 pod available during the roll.

Testing and Validation

Pre-Install Validation

helm lint .
helm template test . --debug

Post-Install Verification

# Check pod status
kubectl get pods -l app.kubernetes.io/name=solace-cache

# View logs
kubectl logs -f solace-cache-0
kubectl logs -f solace-cache-1

# Check config generated correctly
kubectl exec solace-cache-0 -- cat /home/solace/config/config.txt

Health Check

kubectl exec solace-cache-0 -- pgrep -f SolaceCache

Should return PID. If empty, container is not running correctly.

Known Issues and Gotchas

1. Process Name Must Be Exact

Health probes use pgrep -f SolaceCache
Must match capital S and C
Tests use same pattern

2. Init Container Uses Plain Shell (busybox)

The init container is busybox and parses with POSIX shell (grep/cut/awk) — no yq
It extracts the pod ordinal from $HOSTNAME and picks the matching instance name
The instanceNames list is injected by Helm at render time as a space-separated string

3. PDB and Single Replica

PDB with minAvailable: 1 on single replica prevents all voluntary disruptions
This is intentional for "pets" philosophy
Node drains will be blocked until PDB is deleted or pod is force-evicted

4. Image Repository Placeholder

Default values.yaml has placeholder: your-registry/solace-cache
Must be updated before deployment
Production values already override this

5. StatefulSet Update Strategy (manual restart by default)

Configurable via updateStrategy in values; defaults to OnDelete
OnDelete: helm upgrade stages changes but does NOT restart pods — the operator applies them by deleting pods (kubectl delete pod <name>), one ordinal at a time, at a chosen time. Chosen as the default because cache instances are "pets" whose restart timing should be operator-controlled.
RollingUpdate: Kubernetes auto-restarts pods (highest ordinal first) when the pod template changes; supports staged rollouts via rollingUpdate.partition. values-standalone.yaml uses this for dev convenience.
Note: each pod generates its config once at startup (init container → emptyDir), so a changed ConfigMap has no effect on a running pod until it restarts under either strategy. The checksum/config pod annotation advances the StatefulSet revision on config change, which (a) auto-triggers restart under RollingUpdate and (b) marks pods as "needs update" (a staged-change signal) under OnDelete.

Future Enhancements

Potential Improvements

Monitoring - Add Prometheus metrics if SolaceCache exposes them
Backup Strategy - Document or automate cache state backup
Multi-Region - Add topology spread constraints for zone awareness
Readiness Gates - External validation before marking pod ready
Config Validation - Pre-flight checks in init container
Syslog Integration - Enable syslog forwarding for centralized logging

Not Implemented (By Design)

HPA - Autoscaling disabled; replicas should be manually controlled
Ingress - No inbound traffic; cache connects to broker only
Persistence - Uses emptyDir; cache state is ephemeral per pod lifecycle

Troubleshooting Guide

Pods Not Starting

kubectl describe pod solace-cache-0
kubectl logs solace-cache-0 -c init-config  # Check init container

Verify instanceNames array has enough entries for replica count
Check secret exists if using existingSecret
Ensure image is pullable from registry

Slow Shutdown

Verify exec wrapper is present in StatefulSet command
Check if process is running as PID 1: kubectl exec pod -- ps aux

Config Not Updating

Check if checksum annotation is in statefulset.yaml pod template
Verify ConfigMap was actually changed
Force restart: kubectl rollout restart statefulset/solace-cache

Health Probes Failing

Verify process name: kubectl exec pod -- pgrep -f SolaceCache
Check startup time; may need longer initialDelaySeconds
Review logs for crash loops

PDB Blocking Node Drain

kubectl get pdb
kubectl describe pdb solace-cache

Expected behavior for "pets" with single replica
Temporarily disable: kubectl delete pdb solace-cache
Re-enable after drain: helm upgrade --reuse-values

Packaging and Distribution

Create Chart Archive

cd /path/to/solace-cache
helm package .

Produces: solace-cache-0.1.0.tgz

Install from Package

helm install my-cache solace-cache-0.1.0.tgz -f my-values.yaml

Version Management

Update Chart.yaml version field, then repackage:

version: 0.2.0  # Chart version
appVersion: 1.0.12  # Application version

Repository Information

GitHub: Posted by user (April 2026)
Chart Type: Application
License: Not specified
Maintainer: Add to Chart.yaml if needed

Summary

This Helm chart implements a production-ready deployment for Solace Cache with:

✅ Unique per-pod configuration via init container
✅ Fast graceful shutdown via signal-forwarding wrapper script
✅ Operator-controlled (manual) restarts by default; RollingUpdate optional
✅ HA protection with PodDisruptionBudget
✅ Node distribution via pod anti-affinity
✅ Multiple deployment profiles (dev/prod)
✅ Comprehensive documentation and testing

Philosophy: Treat cache instances as stateful "pets" requiring maximum uptime and careful handling during disruptions. The chart prioritizes availability and correctness over scalability.

FilesExpand file tree

PROJECT_CONTEXT.md

Latest commit

History