Skip to content

Conversation

@davidvasandani
Copy link

Problem Statement

When using ECS Anywhere to manage on-premises services, a critical issue occurs when network connectivity is restored after an outage:

Current behavior: Containers that restarted during the connectivity loss are immediately stopped and removed when connectivity returns, causing unexpected downtime of critical services.

This is unacceptable for production environments running critical services that require high availability.

Solution

This PR implements four configurable restart strategies that allow operators to choose how eINS handles restarted containers when connectivity is restored, including options for zero-downtime transitions.

New Restart Strategies

1. cleanup (default) - Backward Compatible

  • Original behavior: Stops and removes restarted containers
  • Use case: Non-critical services where brief downtime is acceptable
  • Maintains backward compatibility with existing deployments

2. preserve - Simple Zero-Downtime

  • Behavior: Keeps all restarted containers running indefinitely
  • Use case: Critical services requiring zero downtime, accepting manual cleanup later
  • Simply resets restart policy and unpauses agent
  • May result in orphaned containers

3. graceful-cutover - Recommended for Critical Services ⭐

  • Behavior: Waits for ECS to launch replacement containers before stopping old ones
  • Use case: Critical services needing zero downtime with eventual ECS management
  • Keeps agent paused until replacements are detected and running
  • Performs seamless cutover from old to new containers
  • Configurable timeout (default 5 minutes via --cutover-timeout)
  • Zero unexpected downtime while maintaining ECS control

4. manual - Full Operator Control

  • Behavior: Requires manual intervention before any changes
  • Use case: High-security or highly-critical environments
  • Agent stays paused, logs warning
  • Operator reviews and manually unpauses agent when ready

Usage

Basic Examples

# Default behavior (backward compatible)
python3 ecs-external-instance-network-sentry.py --region us-east-1

# Preserve strategy - keep containers running
python3 ecs-external-instance-network-sentry.py --region us-east-1 --restart-strategy preserve

# Graceful cutover - zero downtime transition (RECOMMENDED for critical services)
python3 ecs-external-instance-network-sentry.py --region us-east-1 --restart-strategy graceful-cutover

# Graceful cutover with custom timeout
python3 ecs-external-instance-network-sentry.py --region us-east-1 --restart-strategy graceful-cutover --cutover-timeout 600

# Manual control
python3 ecs-external-instance-network-sentry.py --region us-east-1 --restart-strategy manual

Systemd Service Configuration

Update /lib/systemd/system/ecs-external-instance-network-sentry.service:

[Service]
Type=simple
Restart=on-failure
RestartSec=10s
ExecStart=python3 /usr/bin/ecs-external-instance-network-sentry.py --region us-east-1 --restart-strategy graceful-cutover --cutover-timeout 300

How Graceful-Cutover Works

  1. Network outage detected: Agent paused, containers set to on-failure restart policy
  2. Container restarts during outage: Docker automatically restarts failed containers
  3. Connectivity restored: eINS detects it, identifies restarted containers
  4. Waiting phase: Agent remains paused, restarted containers continue running
  5. ECS launches replacements: Control plane deploys fresh task instances
  6. Cutover triggered: Once replacements detected, old containers stopped/removed
  7. Complete: Agent unpaused, ECS resumes normal management

Result: Critical services experience zero unexpected downtime!

Implementation Details

New Configuration Parameters

  • --restart-strategy (choices: cleanup, preserve, graceful-cutover, manual)

    • Default: cleanup
    • Controls behavior when connectivity returns
  • --cutover-timeout (integer, seconds)

    • Default: 300 (5 minutes)
    • Only applies to graceful-cutover strategy
    • Max time to wait for replacement containers

Enhanced Logging

All strategies include comprehensive logging:

  • Strategy selection and configuration
  • Container restart detection
  • Cutover progress (for graceful-cutover)
  • Warnings for timeouts or manual intervention required

State Tracking

Added state variables for graceful-cutover:

  • cutover_in_progress: Tracks if cutover is active
  • cutover_start_time: Timestamp when cutover began
  • restarted_containers: Dictionary mapping container IDs to metadata

Files Changed

  • python/ecs-external-instance-network-sentry.py: Core implementation
  • README.md: Documentation for new parameters and strategies
  • .gitignore: Added to exclude Python cache files

Testing Recommendations

  1. Test in non-production first: Deploy with graceful-cutover strategy
  2. Simulate outage: Block connectivity to ECS endpoint
  3. Trigger container restart: Kill a container process during outage
  4. Monitor logs: Watch cutover process when connectivity returns
  5. Verify zero downtime: Ensure services remain available throughout

Backward Compatibility

Fully backward compatible

  • Default behavior unchanged (cleanup strategy)
  • Existing deployments continue working without modification
  • New parameters are optional

Benefits

✅ Prevents unexpected downtime of critical services
✅ Provides multiple strategies for different use cases
✅ Zero-downtime cutover option with graceful-cutover
✅ Full operator control with manual mode
✅ Comprehensive logging for visibility
✅ Configurable timeouts for flexibility
✅ Backward compatible with existing deployments

🤖 Generated with Claude Code

- Add .gitignore to exclude Python cache files
- Implement configurable restart strategies for ECS External instances
- Update README with comprehensive documentation

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <[email protected]>
@davidvasandani davidvasandani force-pushed the claude/ecs-anywhere-restart-strategy-011CUTG6YiumXPrCBHJpAd3Z branch from 75665b8 to be06111 Compare October 25, 2025 04:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant