Add configurable restart strategies to prevent unexpected downtime #2

davidvasandani · 2025-10-25T04:36:38Z

Problem Statement

When using ECS Anywhere to manage on-premises services, a critical issue occurs when network connectivity is restored after an outage:

Current behavior: Containers that restarted during the connectivity loss are immediately stopped and removed when connectivity returns, causing unexpected downtime of critical services.

This is unacceptable for production environments running critical services that require high availability.

Solution

This PR implements four configurable restart strategies that allow operators to choose how eINS handles restarted containers when connectivity is restored, including options for zero-downtime transitions.

New Restart Strategies

1. `cleanup` (default) - Backward Compatible

Original behavior: Stops and removes restarted containers
Use case: Non-critical services where brief downtime is acceptable
Maintains backward compatibility with existing deployments

2. `preserve` - Simple Zero-Downtime

Behavior: Keeps all restarted containers running indefinitely
Use case: Critical services requiring zero downtime, accepting manual cleanup later
Simply resets restart policy and unpauses agent
May result in orphaned containers

3. `graceful-cutover` - Recommended for Critical Services ⭐

Behavior: Waits for ECS to launch replacement containers before stopping old ones
Use case: Critical services needing zero downtime with eventual ECS management
Keeps agent paused until replacements are detected and running
Performs seamless cutover from old to new containers
Configurable timeout (default 5 minutes via --cutover-timeout)
Zero unexpected downtime while maintaining ECS control

4. `manual` - Full Operator Control

Behavior: Requires manual intervention before any changes
Use case: High-security or highly-critical environments
Agent stays paused, logs warning
Operator reviews and manually unpauses agent when ready

Usage

Basic Examples

# Default behavior (backward compatible)
python3 ecs-external-instance-network-sentry.py --region us-east-1

# Preserve strategy - keep containers running
python3 ecs-external-instance-network-sentry.py --region us-east-1 --restart-strategy preserve

# Graceful cutover - zero downtime transition (RECOMMENDED for critical services)
python3 ecs-external-instance-network-sentry.py --region us-east-1 --restart-strategy graceful-cutover

# Graceful cutover with custom timeout
python3 ecs-external-instance-network-sentry.py --region us-east-1 --restart-strategy graceful-cutover --cutover-timeout 600

# Manual control
python3 ecs-external-instance-network-sentry.py --region us-east-1 --restart-strategy manual

Systemd Service Configuration

Update /lib/systemd/system/ecs-external-instance-network-sentry.service:

[Service]
Type=simple
Restart=on-failure
RestartSec=10s
ExecStart=python3 /usr/bin/ecs-external-instance-network-sentry.py --region us-east-1 --restart-strategy graceful-cutover --cutover-timeout 300

How Graceful-Cutover Works

Network outage detected: Agent paused, containers set to on-failure restart policy
Container restarts during outage: Docker automatically restarts failed containers
Connectivity restored: eINS detects it, identifies restarted containers
Waiting phase: Agent remains paused, restarted containers continue running
ECS launches replacements: Control plane deploys fresh task instances
Cutover triggered: Once replacements detected, old containers stopped/removed
Complete: Agent unpaused, ECS resumes normal management

Result: Critical services experience zero unexpected downtime!

Implementation Details

New Configuration Parameters

--restart-strategy (choices: cleanup, preserve, graceful-cutover, manual)
- Default: cleanup
- Controls behavior when connectivity returns
--cutover-timeout (integer, seconds)
- Default: 300 (5 minutes)
- Only applies to graceful-cutover strategy
- Max time to wait for replacement containers

Enhanced Logging

All strategies include comprehensive logging:

Strategy selection and configuration
Container restart detection
Cutover progress (for graceful-cutover)
Warnings for timeouts or manual intervention required

State Tracking

Added state variables for graceful-cutover:

cutover_in_progress: Tracks if cutover is active
cutover_start_time: Timestamp when cutover began
restarted_containers: Dictionary mapping container IDs to metadata

Files Changed

python/ecs-external-instance-network-sentry.py: Core implementation
README.md: Documentation for new parameters and strategies
.gitignore: Added to exclude Python cache files

Testing Recommendations

Test in non-production first: Deploy with graceful-cutover strategy
Simulate outage: Block connectivity to ECS endpoint
Trigger container restart: Kill a container process during outage
Monitor logs: Watch cutover process when connectivity returns
Verify zero downtime: Ensure services remain available throughout

Backward Compatibility

✅ Fully backward compatible

Default behavior unchanged (cleanup strategy)
Existing deployments continue working without modification
New parameters are optional

Benefits

✅ Prevents unexpected downtime of critical services
✅ Provides multiple strategies for different use cases
✅ Zero-downtime cutover option with graceful-cutover
✅ Full operator control with manual mode
✅ Comprehensive logging for visibility
✅ Configurable timeouts for flexibility
✅ Backward compatible with existing deployments

🤖 Generated with Claude Code

- Add .gitignore to exclude Python cache files - Implement configurable restart strategies for ECS External instances - Update README with comprehensive documentation 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <[email protected]>

davidvasandani force-pushed the claude/ecs-anywhere-restart-strategy-011CUTG6YiumXPrCBHJpAd3Z branch from 75665b8 to be06111 Compare October 25, 2025 04:42

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add configurable restart strategies to prevent unexpected downtime #2

Add configurable restart strategies to prevent unexpected downtime #2

Uh oh!

davidvasandani commented Oct 25, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Add configurable restart strategies to prevent unexpected downtime #2

Are you sure you want to change the base?

Add configurable restart strategies to prevent unexpected downtime #2

Uh oh!

Conversation

davidvasandani commented Oct 25, 2025

Problem Statement

Solution

New Restart Strategies

1. cleanup (default) - Backward Compatible

2. preserve - Simple Zero-Downtime

3. graceful-cutover - Recommended for Critical Services ⭐

4. manual - Full Operator Control

Usage

Basic Examples

Systemd Service Configuration

How Graceful-Cutover Works

Implementation Details

New Configuration Parameters

Enhanced Logging

State Tracking

Files Changed

Testing Recommendations

Backward Compatibility

Benefits

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

1. `cleanup` (default) - Backward Compatible

2. `preserve` - Simple Zero-Downtime

3. `graceful-cutover` - Recommended for Critical Services ⭐

4. `manual` - Full Operator Control