Skip to content

Commit a61a2d3

Browse files
author
Nebojsa Prodana
committed
DODO-2452: scaledown analysis
1 parent 448888f commit a61a2d3

File tree

11 files changed

+1298
-0
lines changed

11 files changed

+1298
-0
lines changed
Lines changed: 16 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,16 @@
1+
2+
# Argo Rollouts Issues Analysis
3+
4+
| Issue Name | Description | Supporting Evidence / Useful Links | Notes |
5+
|------------|-------------|-----------------------------------|-------|
6+
| Support scaleDownDelaySeconds & fast rollbacks with canary strategy | Currently argo-rollouts only supports fast-track rollback when a canary deployment is in progress. The enhancement requests adding support for keeping the previous version around for scaleDownDelaySeconds (similar to blue-green strategy) to allow fast rollback for canary deployments in case metric checks don't catch regressions. | [GitHub Issue #557](https://github.com/argoproj/argo-rollouts/issues/557) | Blue-green strategy already supports this feature with scaleDownDelaySeconds. This would bring feature parity between deployment strategies and improve rollback capabilities for canary deployments. |
7+
| Argo-rollouts ignores maxSurge and maxUnavailable when traffic shifting is used | When traffic shifting is used, argo-rollouts ignores the maxSurge and maxUnavailable settings, which can impact cluster autoscaling by putting additional pressure on Karpenter to binpack or provide new nodes. | [Support scaleDownDelaySeconds & fast rollbacks with canary strategy](https://github.com/argoproj/argo-rollouts/issues/557) | Can have impact on cluster autoscaling putting additional pressure on karpenter to binpack or provide new nodes. Combined with flaky health checks and aggressive autoscaling that larger services might be unwittingly using, this can lead to long deployment times per cluster. |
8+
| Argo-rollouts waits for stable RS to be stable before scaling it down | When used on a large scale with a cluster autoscaler that can disrupt nodes and evict pods, the canary RS stays scaled-up for a while until the stable RS is fully scaled. This makes sense if the controller scaled down the stable RS during the rollout (using dynamicStableScale), but it doesn't make sense if it didn't. | [GitHub PR #3899](https://github.com/argoproj/argo-rollouts/pull/3899) | This behavior can cause resource inefficiency and increased costs when the stable RS wasn't scaled down during rollout but the canary RS remains scaled up unnecessarily. |
9+
10+
argo-rollouts
11+
12+
argo-rollouts waits for stable RS to be stable before scaling it down
13+
14+
https://github.com/argoproj/argo-rollouts/pull/3899
15+
16+
 When used on a large scale with a cluster autoscaler that can disrupt nodes and evict pods, the canary RS stays scaled-up for a while until the stable RS is fully scaled. This makes sense if the controller scaled down the stable RS during the rollout (using dynamicStableScale), but it doesn't make sense if it didn't.

issue_analysis/point_scale.md

Lines changed: 334 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,334 @@
1+
We have agreed to use the following for our story pointing guideline.
2+
3+
![point scale table](point_scale_table.png)
4+
5+
Additional pointers on Story Points
6+
7+
We should target breaking tickets up into as small size as possible that result in value being delivered in “logical chunks”.
8+
9+
We should aim to have a majority of 1, 2, 3 point tickets, slightly fewer 5 point tickets, and fewer still 8 point tickets.
10+
11+
Where possible tickets that could span multiple sprints should be avoided.
12+
13+
In a week long Dodo sprint, 8 point tickets are high risk and should be avoided if possible.
14+
15+
It’s OK to give tickets 8 and 13 points, but try to consider them to be placeholder tickets that are indicating work that needs more research and to be broken up into smaller tickets.
16+
17+
In no circumstance will a 13 point ticket be allowed into a sprint.
18+
19+
20+
21+
Everything below this line are notes taken from courses on the topic of Sprint management. They are included here for information on good practise and may be of interest, but should not be taken as gospel for Dodo.
22+
Creating tickets/stories/tasks (currently just notes, will refine as we progress)
23+
24+
Reduce scope as much as is reasonable. The Pareto principle.
25+
DoD (not acceptance criteria):
26+
27+
Focus on the valuable outcomes - What matters to our customers?
28+
29+
Evolve over time based on feedback and experience.
30+
31+
Keep it as concise as possible.
32+
33+
Don’t overthink edge-cases.
34+
35+
Make it visible to any and all stakeholders.
36+
37+
Examples:
38+
39+
How do we test?
40+
41+
“stuff is tested” - What does this mean specifically?
42+
43+
Response times?
44+
45+
46+
47+
The Sprint
48+
49+
Required elements of a sprint:
50+
51+
What do we want to achieve? - Goal
52+
53+
How will we achieve it? - Plan
54+
55+
How will we keep on track? - Scrum
56+
57+
How will we know if we achieved it? - Review
58+
59+
How will we do better next time? - Retro
60+
61+
We need all of the above together for each individual piece to make sense. (Think a stone arch - take one block out & it will collapse).
62+
Sprint Planning
63+
64+
Inputs
65+
66+
Objective
67+
68+
Backlog
69+
70+
Product increment
71+
72+
Capacity & past performance
73+
74+
1 improvement from retro
75+
76+
Outputs
77+
78+
Sprint backlog (the board with lots of tickets initially in “Todo”)
79+
80+
81+
82+
83+
Sprint Backlog
84+
85+
The Board.
86+
87+
The team’s plan to achieve the Sprint Goal.
88+
89+
It will change and adapt as more is learnt throughout the Sprint.
90+
91+
Add tickets and remove them as additional details are learnt.
92+
93+
Changes in scope are fine.
94+
95+
If you bring things in do other tickets have to go out? (Probably)
96+
97+
Should you adjust your goal? (Hopefully not, but possible if necessary)
98+
99+
If change happens regularly, it’s a symptom of not enough planning.
100+
101+
Predict the predictable, embrace the surprises.
102+
103+
Daily Scrum
104+
105+
Assess the current state of the plan.
106+
107+
NOT just a status update.
108+
109+
“Are we on track? If not, what should be do about it?”
110+
Sprint Review
111+
112+
We don’t do this very well in Skyscanner.
113+
114+
Have a separate Zoom link for Planning, Review & Retro.
115+
116+
Opportunity for stakeholders to be present
117+
118+
Gain perspective.
119+
120+
What has been accomplished this sprint?
121+
122+
What challenges has we experienced?
123+
124+
What might come next and are there any risks?
125+
126+
Discuss what competitors have done recently.
127+
128+
Discuss market changes and future opportunities.
129+
130+
Recent example: ChatGPT - What does this mean for us?
131+
132+
Should we explore ways to use it?
133+
134+
Should we put it on our PDTs?
135+
136+
New libraries new, frameworks?
137+
138+
Sprint Retro
139+
140+
How we worked together in the sprint
141+
142+
The retro is explicitly about seeking improvements
143+
144+
Consider:
145+
146+
Individuals
147+
148+
Interactions
149+
150+
Processes
151+
152+
Tools
153+
154+
DoD
155+
156+
Select >= 1 action from the discussion to improve the next sprint.
157+
158+
Some kind of retro should happen each sprint, it may be that it can be a small thing for one or two weeks then a bigger thing on the next week.
159+
160+
Occasionally the retro should be highly focused on a specific topic.
161+
162+
Don’t just do the same retro format each week. Can be the same most weeks, but mix it up liven things up and focus on different aspects of the sprint.
163+
Overall
164+
165+
Timeboxed (<1 month) at a consistent duration.
166+
167+
Sprints deliver value by solving a meaningful problem. Would a stakeholder be willing to spend time (or money) to upgrade to what you do in that sprint.
168+
169+
Sprints protect the team from distractions and changes in direction.
170+
Scrum Team
171+
172+
Cross functional
173+
174+
Multi-skilled
175+
176+
Stable composition of a team
177+
178+
Constantly changing team limits psychological safety
179+
180+
Difficult to understand strengths and weaknesses.
181+
182+
<= 10 people
183+
184+
Self-organising
185+
186+
Leaders will emerge
187+
188+
Different people will naturally start to take different roles.
189+
190+
Squad leads should encourage self-organisation
191+
192+
Non-hierarchical
193+
194+
Different levels will have different ideas and different input.
195+
196+
Sometimes less experienced people will see simpler solutions, for example.
197+
198+
Fresh perspective can be valuable
199+
200+
Developer
201+
202+
Contributor to any aspect of a usable increment each sprint.
203+
204+
Able to plan the work to the goal and execute it.
205+
Product Owner
206+
207+
Accountable for maximising the value of the product resulting from the work on the Scrum Team.
208+
209+
Accountable (not necessarily responsible) for:
210+
211+
Developing and explicitly communicating the product goal
212+
213+
Creating and comms for the Product Backlog
214+
215+
Ordering (prioritising) the product backlog
216+
217+
Ensure the product backlog is transparent, visible and understood.
218+
219+
Note for Prod Plat context: This doesn’t exist, role mostly falls to SL, but should be shared amongst the team.
220+
Scrum Master
221+
222+
Servant leader. Accountable for establishing Scrum as defined in the Scrum Guide. They do this by helping everyone understand Scrum Theory and practice, both within the Scrum Team and the org.
223+
224+
Coach team members in self-org
225+
226+
Help team focus on high value increments
227+
228+
Cause the removal of impediments
229+
230+
Ensure all Scrum Events are positive, productive & efficient.
231+
232+
Should be active and challenge the team during ceremonies.
233+
234+
Should not be rotated too quickly (if at all), no less than a month at a time. Less than this doesn’t give the chance to make a change.
235+
Backlog Refinement
236+
237+
An ongoing activity, by a very rough estimate it could take up to 10% of the Sprint capacity.
238+
239+
Backlog refinement should be a discovery exercise, working towards everybody understanding the work/roadmap. Constantly strive to understand what’s coming up for the product. Where are we today? What could we do to improve going forwards? Understand how other people in the company, or other people in the industry, are solving problems.
240+
241+
Creating Sprint-ready backlog items
242+
243+
Re-prioritise
244+
245+
New tickets created
246+
247+
Unnecessary tickets removed
248+
249+
Acceptance criteria added
250+
251+
Larger tickets (or epics) broken up into end-to-end slices.
252+
253+
Tickets are estimated (or thinly sliced to the same size)
254+
255+
Ticket Template can be useful, some ideas for it:
256+
257+
What is the change?
258+
259+
Why are we making it?
260+
261+
Any useful links?
262+
263+
Testing steps?
264+
265+
Acceptance criteria.
266+
267+
Estimation
268+
269+
Watch the Agile Estimation Skyscanner University course for more detail.
270+
271+
Should be quick & painless
272+
273+
Should be collaborative involving the whole team
274+
275+
The closer the work, the less valuable the estimate is in its own right and the more valuable the conversation is.
276+
277+
Don’t get stuck between 2s and 3s - not a valuable conversation. Remove one from consideration?
278+
279+
Plandek - Look at average time to deliver story points; 2 and 3 points often have no difference.
280+
281+
The are estimates, not quotes, not commitments, not promises or guarantees.
282+
283+
Forecasting progress
284+
285+
Velocity: speed = distance / time
286+
287+
Monte carlo estimation - probability based forecasting.
288+
289+
Don’t plan the unplannable
290+
291+
If there are unknowns, use larger brush strokes, refining the brush as we go along.
292+
293+
Combined Explore Question Mark
294+
295+
Use data to understand typical progress and trends
296+
297+
Stakeholders will appreciate this.
298+
299+
Don’t be lured by optimistic/pessimistic tendencies.
300+
301+
Predict based on the data and trends you have.
302+
303+
Burndown charts
304+
305+
Plot expected progress vs actual progress
306+
307+
Not be-all-end-all, but useful.
308+
309+
Can be used mid-sprint to check if you’re on track.
310+
When Scrum works and when it doesn’t
311+
312+
Scrums works pretty well most of the time.
313+
314+
Works well for complex problems with…
315+
316+
unpredictabilty
317+
318+
unknown-unknowns
319+
320+
established general direction
321+
322+
Other models such as Kanban can work better for complicated problems:
323+
324+
Predictable aspects
325+
326+
Some unknowns, but mostly understoof
327+
328+
Established end state
329+
330+
Could start with Scrum and when problems become complicated rather than complex you could move to Kanban.
331+
332+
333+
334+
1 MB
Loading

0 commit comments

Comments
 (0)