Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion VERSION
Original file line number Diff line number Diff line change
@@ -1 +1 @@
1.2.0
1.2.0-surge
Original file line number Diff line number Diff line change
@@ -0,0 +1,56 @@
# Argo Rollouts: maxSurge/maxUnavailable Ignored with Traffic Shifting

## Issue Summary

**Problem**: `maxSurge` and `maxUnavailable` settings ignored in traffic-routed canary deployments.

**Root Cause**: Intentional design decision prioritizing traffic control over scaling limits.

## Documented Issues

| Issue # | Title | Status |
|---------|-------|--------|
| #2239 | Traffic Routing and maxSurge/maxUnavailable | Open/Design Discussion |
| #3284 | [Request] Support maxSurge/maxUnavailable with traffic routing | Open |
| #3539 | maxSurge and maxUnavailable ignored when using traffic routing | Open |
| #3397 | maxSurge/maxUnavailable not respected with traffic routing | Open |

## Technical Details

**Code Location**: `utils/replicaset/canary.go`

**Basic Canary**: `CalculateReplicaCountsForBasicCanary()` respects maxSurge/maxUnavailable.

**Traffic-Routed Canary**: `CalculateReplicaCountsForTrafficRoutedCanary()` ignores these settings.

**Maintainer Rationale** (from CNCF Slack): Traffic routing prioritizes traffic control over pod scaling limits.

## Functionality Matrix

| Feature | Basic Canary | Traffic-Routed Canary |
|---------|-------------|----------------------|
| maxSurge/maxUnavailable | ✅ Supported | ❌ Ignored by design |
| Traffic Weight Control | ❌ N/A | ✅ Supported |
| scaleDownDelaySeconds | ❌ Validation error | ✅ Supported |

## Current Workarounds

**MinPodsPerReplicaSet**: Provides minimum pod floor but doesn't control maximum surge.

**Manual Canary Steps**: Users create many small steps to control scaling speed.

## Design Philosophy Conflict

**Maintainer View**: Traffic control takes precedence over scaling limits.

**User View**: Need scaling controls to prevent "very rapid infrastructure scaling".

## Manual Exploration Required

**Investigate Further**:
- Review GitHub issue #2239 for maintainer design discussion
- Examine CNCF Slack discussions on traffic routing philosophy
- Test behavior differences between basic and traffic-routed canaries
- Analyze impact of dynamicStableScale on scaling limits

**Key Question**: Should traffic routing support both traffic control AND scaling limits?
Original file line number Diff line number Diff line change
@@ -0,0 +1,42 @@
# Current State Analysis: maxSurge/maxUnavailable Ignored with Traffic Shifting

## Issue Summary

Traffic-routed canary deployments ignore `maxSurge` and `maxUnavailable` settings, unlike basic canary deployments.

## Critical Code Locations

**Basic Canary**: `utils/replicaset/canary.go:CalculateReplicaCountsForBasicCanary()` - respects maxSurge/maxUnavailable.

**Traffic-Routed Canary**: `utils/replicaset/canary.go:CalculateReplicaCountsForTrafficRoutedCanary()` - ignores these settings.

**Code Flow**:
1. `rolloutCanary()` → `reconcileCanaryReplicaSets()`
2. `reconcileNewReplicaSet()` → `NewRSNewReplicas()`
3. Branches based on traffic routing presence

## Current Behavior

**Basic Canary**: Uses `MaxSurge(rollout)` to limit total replica count.

**Traffic-Routed Canary**: Calculates replicas based solely on traffic weights, can exceed configured limits.

## Expected Behavior

Traffic-routed canaries should respect maxSurge/maxUnavailable to prevent excessive scaling.

## Key Findings

- **Root Cause**: `CalculateReplicaCountsForTrafficRoutedCanary()` lacks maxSurge/maxUnavailable logic
- **Impact**: Unbounded scaling potential in traffic-routed deployments
- **Affected Scenarios**: All canary rollouts with traffic routing (Istio, ALB, SMI)

## Manual Exploration Required

**Investigate Further**:
- Compare code implementations between basic and traffic-routed canary functions
- Review test cases to confirm behavior differences
- Examine how traffic weights drive scaling decisions
- Test scaling behavior with different traffic routing providers

**Key Question**: Why do the two canary implementations have different scaling logic?
Original file line number Diff line number Diff line change
@@ -0,0 +1,29 @@
# Relevance Assessment: maxSurge/maxUnavailable Ignored with Traffic Shifting

## Issue Relevance: HIGH

**Status**: Still present in current codebase.

## Technical Evidence

**Basic Canary**: `CalculateReplicaCountsForBasicCanary()` uses `MaxSurge(rollout)` to respect limits.

**Traffic-Routed Canary**: `CalculateReplicaCountsForTrafficRoutedCanary()` ignores maxSurge/maxUnavailable.

**Test Evidence**: Code review confirms traffic-routed tests don't validate surge behavior.

## Impact Analysis

**Operational Impact**: Excessive scaling affects cluster autoscaling efficiency and cost.

**Affected Users**: All using traffic-routed canaries (Istio, ALB, SMI) with configured scaling limits.

## Manual Exploration Required

**Investigate Further**:
- Review recent commits for any changes to scaling logic
- Test behavior with different traffic routing configurations
- Analyze impact on cluster autoscaling systems
- Examine user reports of scaling issues

**Key Question**: How does this issue affect real-world deployment costs and performance?
Original file line number Diff line number Diff line change
@@ -0,0 +1,36 @@
# Historical Analysis: maxSurge/maxUnavailable Evolution

## Key Commits

| Commit | Date | Message | Impact |
|--------|------|---------|--------|
| c1353e4a44 | 2021-09-21 | feat: support dynamic scaling of stable ReplicaSet (#1430) | Introduced traffic-routed canary without maxSurge support |
| 0f93f9b82e | 2022-01-14 | fix!: improve basic canary and honor maxSurge (#1759) | Added maxSurge to basic canary only |

## Design Philosophy

**Traffic Routing**: Intentionally prioritizes traffic control over pod scaling limits (maintainer stance from issue #2239).

**Basic Canary**: Respects Kubernetes deployment scaling controls.

## Community Issues

**Multiple Requests**: Issues #3284, #3539, #3397 all request maxSurge/maxUnavailable support despite maintainer design.

**User Impact**: Reports of "very rapid infrastructure scaling" causing cost and resource issues.

## Current Workarounds

**MinPodsPerReplicaSet**: Provides minimum pod floor but doesn't control maximum surge.

**Manual Steps**: Users create many small canary steps to control scaling speed.

## Manual Exploration Required

**Investigate Further**:
- Review commit c1353e4a44 that introduced traffic routing without scaling limits
- Examine CNCF Slack discussions on design philosophy
- Analyze evolution of basic vs traffic-routed canary implementations
- Review user issue reports for scaling problems

**Key Question**: When did the design philosophy divergence between canary types occur?
Original file line number Diff line number Diff line change
@@ -0,0 +1,35 @@
# Upstream Analysis: maxSurge/maxUnavailable with Traffic Shifting

## Documented Issues

| Issue # | Title | Status | Key Insight |
|---------|-------|--------|-------------|
| #2239 | Traffic Routing and maxSurge/maxUnavailable | Open | Zach Aller explains intentional design decision |
| #1759 | improve basic canary and honor maxSurge | Closed | Added maxSurge to basic canary only |
| #3284 | Support maxSurge/maxUnavailable with traffic routing | Open | Community feature request |
| #3539 | maxSurge and maxUnavailable ignored | Open | User reports scaling issues |
| #3397 | maxSurge/maxUnavailable not respected | Open | Another scaling issue report |

## Community Context

**Maintainer Stance**: Traffic routing intentionally prioritizes traffic control over scaling limits.

**User Demand**: Multiple issues report "very rapid infrastructure scaling" problems.

**Design Conflict**: Traffic control vs pod scaling limits creates fundamental tension.

## Implementation Patterns

**Basic Canary**: Properly implements maxSurge via `maxReplicaCountAllowed = rolloutSpecReplica + maxSurge`.

**Traffic-Routed Canary**: No maxSurge/maxUnavailable logic, scales based on traffic weights only.

## Manual Exploration Required

**Investigate Further**:
- Review maintainer discussions in issue #2239
- Examine CNCF Slack threads on traffic routing design
- Compare with other progressive delivery tools
- Analyze user cost impact reports

**Key Question**: Is the design philosophy conflict resolvable, or is this a fundamental limitation?
Original file line number Diff line number Diff line change
@@ -0,0 +1,48 @@
# Contribution Epic: maxSurge/maxUnavailable with Traffic Shifting

## Epic Overview

**Problem**: maxSurge/maxUnavailable ignored in traffic-routed canaries due to intentional design prioritizing traffic control over scaling limits.

**Challenge**: Requires either changing maintainer design philosophy or accepting fundamental limitation.

## Exploration Directions

### Design Philosophy Evaluation
- Assess whether traffic routing should support both traffic control AND scaling limits
- Evaluate community consensus from multiple open issues
- Analyze impact of dynamicStableScale on implementation feasibility

### Technical Approaches
- Compare absolute vs relative interpretations of scaling limits
- Examine integration with existing traffic weight logic
- Evaluate validation warnings for unsupported configurations

### Workaround Enhancement
- Improve MinPodsPerReplicaSet documentation
- Provide infrastructure scaling best practices
- Enhance guidance for manual canary step approaches

## Open Questions

- Should design philosophy be revisited for scaling limits?
- How does dynamicStableScale affect implementation?
- What are trade-offs between absolute and relative approaches?
- Can MinPodsPerReplicaSet better address scaling concerns?

## Success Criteria

1. Clear understanding of design philosophy conflict
2. Improved documentation for current workarounds
3. Community consensus on implementation direction
4. Enhanced user experience for scaling control

## Manual Exploration Required

**Investigate Further**:
- Engage maintainers in issue #2239 discussion
- Survey community sentiment across multiple issues
- Prototype scaling limit implementations
- Test impact on traffic routing behavior

**Key Question**: Is there a technical solution that respects both traffic control and scaling limits?
Original file line number Diff line number Diff line change
@@ -0,0 +1,36 @@
# Argo Rollouts Ignores maxSurge/maxUnavailable in Traffic Shifting

## Source-Backed Facts

### Current Implementation
- **Traffic-routed canaries ignore maxSurge/maxUnavailable**: Code in `utils/replicaset/canary.go` shows `CalculateReplicaCountsForTrafficRoutedCanary` does not apply these limits
- **Basic canary respects limits**: `CalculateReplicaCountsForBasicCanary` applies maxSurge/maxUnavailable to total rollout replicas
- **Design decision**: Traffic routing prioritizes traffic control over pod scaling limits (maintainer position from GitHub issues)

### GitHub Issues
- **#3284**: User reports maxSurge/maxUnavailable ignored in traffic-routed deployments
- **#3539**: Request for scaling control in traffic-routed canaries
- **#3397**: Discussion of scaling behavior differences between canary types
- **#2239**: Related scaling behavior analysis

### Code References
- `rollout/canary.go`: Traffic weight to replica calculations
- `utils/replicaset/canary.go`: Replica count calculation functions
- `utils/conditions/conditions.go`: RolloutHealthy function with availability requirements

## Manual Exploration Required

### Implementation Questions
- How does `dynamicStableScale` setting affect scaling behavior?
- What are the exact replica calculation differences between basic and traffic-routed canaries?
- How does `minPodsPerReplicaSet` interact with traffic routing?

### Design Philosophy Questions
- Should traffic routing continue to override scaling limits?
- What are the trade-offs between traffic control and scaling control?
- How do other progressive delivery tools handle this conflict?

### User Impact Questions
- What specific scaling problems do users encounter without these limits?
- How effective are current workarounds (manual steps, minPodsPerReplicaSet)?
- What validation warnings should be provided for unsupported configurations?
Loading
Loading