-
Notifications
You must be signed in to change notification settings - Fork 0
Expand file tree
/
Copy pathpostmortems.html
More file actions
317 lines (292 loc) · 16 KB
/
postmortems.html
File metadata and controls
317 lines (292 loc) · 16 KB
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="UTF-8">
<meta name="viewport" content="width=device-width, initial-scale=1.0">
<title>Incident Postmortems — AgentBox</title>
<meta name="description" content="Detailed postmortem reports for past AgentBox incidents. Transparency into what went wrong, how we fixed it, and what we're doing to prevent recurrence.">
<style>
*{margin:0;padding:0;box-sizing:border-box}
:root{--bg:#0a0a0f;--surface:#12121a;--card:#1a1a2e;--border:#2a2a4a;--text:#e0e0f0;--muted:#8888aa;--accent:#6c5ce7;--accent2:#a29bfe;--green:#00b894;--orange:#fdcb6e;--red:#e17055;--blue:#74b9ff;--radius:12px}
body{font-family:-apple-system,BlinkMacSystemFont,'Segoe UI',Roboto,sans-serif;background:var(--bg);color:var(--text);min-height:100vh;line-height:1.6}
a{color:var(--accent2);text-decoration:none}
a:hover{text-decoration:underline}
.top-nav{display:flex;align-items:center;justify-content:space-between;padding:16px 32px;border-bottom:1px solid var(--border);background:var(--surface)}
.top-nav .logo{font-size:1.2rem;font-weight:700;color:var(--text)}
.top-nav .logo span{margin-right:6px}
.top-nav .back{color:var(--muted);font-size:.9rem}
.hero{text-align:center;padding:48px 20px 24px}
.hero h1{font-size:2rem;font-weight:700;margin-bottom:8px}
.hero p{color:var(--muted);max-width:560px;margin:0 auto;font-size:1rem}
.container{max-width:820px;margin:0 auto;padding:0 20px 80px}
/* Filters */
.filters{display:flex;gap:8px;flex-wrap:wrap;margin:24px 0 32px;justify-content:center}
.filter-btn{padding:8px 18px;border-radius:24px;border:1px solid var(--border);background:var(--surface);color:var(--muted);cursor:pointer;font-size:.85rem;transition:all .2s}
.filter-btn:hover,.filter-btn.active{background:var(--accent);color:#fff;border-color:var(--accent)}
/* Severity badge */
.severity{display:inline-block;padding:3px 10px;border-radius:12px;font-size:.75rem;font-weight:600;text-transform:uppercase;letter-spacing:.5px}
.severity.critical{background:rgba(225,112,85,.15);color:var(--red)}
.severity.major{background:rgba(253,203,110,.15);color:var(--orange)}
.severity.minor{background:rgba(0,184,148,.15);color:var(--green)}
/* Incident cards */
.incident-card{background:var(--card);border:1px solid var(--border);border-radius:var(--radius);margin-bottom:16px;overflow:hidden;transition:border-color .2s}
.incident-card:hover{border-color:var(--accent)}
.incident-header{display:flex;align-items:center;justify-content:space-between;padding:20px 24px;cursor:pointer;gap:12px}
.incident-header h3{font-size:1rem;font-weight:600;flex:1}
.incident-meta{display:flex;align-items:center;gap:12px;flex-shrink:0}
.incident-date{color:var(--muted);font-size:.8rem;white-space:nowrap}
.incident-duration{color:var(--blue);font-size:.8rem;white-space:nowrap}
.expand-icon{color:var(--muted);transition:transform .2s;font-size:.8rem}
.incident-card.open .expand-icon{transform:rotate(180deg)}
.incident-body{display:none;padding:0 24px 24px;border-top:1px solid var(--border)}
.incident-card.open .incident-body{display:block}
.pm-section{margin-top:20px}
.pm-section h4{font-size:.85rem;font-weight:600;color:var(--accent2);text-transform:uppercase;letter-spacing:.6px;margin-bottom:8px}
.pm-section p,.pm-section ul{color:var(--text);font-size:.9rem;margin-bottom:4px}
.pm-section ul{padding-left:20px}
.pm-section li{margin-bottom:4px}
.timeline{position:relative;padding-left:24px;margin-top:12px}
.timeline::before{content:'';position:absolute;left:6px;top:4px;bottom:4px;width:2px;background:var(--border)}
.tl-entry{position:relative;margin-bottom:12px;padding-left:16px}
.tl-entry::before{content:'';position:absolute;left:-22px;top:6px;width:10px;height:10px;border-radius:50%;background:var(--accent);border:2px solid var(--bg)}
.tl-entry .tl-time{font-size:.75rem;color:var(--muted);font-weight:600}
.tl-entry .tl-text{font-size:.85rem;color:var(--text)}
/* Impact metrics */
.impact-grid{display:grid;grid-template-columns:repeat(auto-fit,minmax(140px,1fr));gap:10px;margin-top:12px}
.impact-item{background:var(--surface);border:1px solid var(--border);border-radius:8px;padding:12px;text-align:center}
.impact-item .value{font-size:1.2rem;font-weight:700;color:var(--accent2)}
.impact-item .label{font-size:.7rem;color:var(--muted);text-transform:uppercase;letter-spacing:.5px;margin-top:2px}
/* Stats banner */
.stats-bar{display:grid;grid-template-columns:repeat(auto-fit,minmax(160px,1fr));gap:12px;margin:32px 0}
.stat-card{background:var(--card);border:1px solid var(--border);border-radius:var(--radius);padding:20px;text-align:center}
.stat-card .num{font-size:1.6rem;font-weight:700;color:var(--accent2)}
.stat-card .desc{font-size:.8rem;color:var(--muted);margin-top:4px}
.footer{text-align:center;color:var(--muted);font-size:.8rem;padding:32px 20px;border-top:1px solid var(--border)}
@media(max-width:600px){
.top-nav{padding:12px 16px}
.hero h1{font-size:1.5rem}
.incident-header{flex-wrap:wrap}
.incident-meta{width:100%;margin-top:8px}
.impact-grid{grid-template-columns:1fr 1fr}
}
</style>
</head>
<body>
<nav class="top-nav">
<div class="logo"><span>🤖</span> AgentBox</div>
<a href="status-page.html" class="back">← System Status</a>
</nav>
<div class="hero">
<h1>Incident Postmortems</h1>
<p>We believe in transparency. Here's a detailed look at past incidents — what happened, why, and what we did to prevent recurrence.</p>
</div>
<div class="container">
<div class="stats-bar">
<div class="stat-card"><div class="num">99.95%</div><div class="desc">Uptime (last 90 days)</div></div>
<div class="stat-card"><div class="num">4</div><div class="desc">Incidents (last 90 days)</div></div>
<div class="stat-card"><div class="num">18 min</div><div class="desc">Avg. Resolution Time</div></div>
<div class="stat-card"><div class="num">12</div><div class="desc">Action Items Completed</div></div>
</div>
<div class="filters">
<button class="filter-btn active" data-filter="all">All</button>
<button class="filter-btn" data-filter="critical">Critical</button>
<button class="filter-btn" data-filter="major">Major</button>
<button class="filter-btn" data-filter="minor">Minor</button>
</div>
<!-- Incident 1 -->
<div class="incident-card" data-severity="critical">
<div class="incident-header" onclick="toggleIncident(this)">
<h3>API Gateway Outage — Complete Service Disruption</h3>
<div class="incident-meta">
<span class="severity critical">Critical</span>
<span class="incident-duration">⏱ 42 min</span>
<span class="incident-date">Mar 15, 2026</span>
<span class="expand-icon">▼</span>
</div>
</div>
<div class="incident-body">
<div class="pm-section">
<h4>Summary</h4>
<p>A misconfigured rate-limiting rule in the API gateway caused all incoming requests to be rejected with 429 status codes. The issue was triggered during a routine configuration deployment at 14:22 UTC.</p>
</div>
<div class="pm-section">
<h4>Impact</h4>
<div class="impact-grid">
<div class="impact-item"><div class="value">100%</div><div class="label">Users Affected</div></div>
<div class="impact-item"><div class="value">42 min</div><div class="label">Duration</div></div>
<div class="impact-item"><div class="value">~12,400</div><div class="label">Failed Requests</div></div>
<div class="impact-item"><div class="value">0</div><div class="label">Data Loss</div></div>
</div>
</div>
<div class="pm-section">
<h4>Timeline</h4>
<div class="timeline">
<div class="tl-entry"><div class="tl-time">14:22 UTC</div><div class="tl-text">Configuration deployment begins</div></div>
<div class="tl-entry"><div class="tl-time">14:24 UTC</div><div class="tl-text">Monitoring alerts fire — 429 error rate spikes to 100%</div></div>
<div class="tl-entry"><div class="tl-time">14:28 UTC</div><div class="tl-text">On-call engineer acknowledges alert, begins investigation</div></div>
<div class="tl-entry"><div class="tl-time">14:35 UTC</div><div class="tl-text">Root cause identified — rate limit set to 0 req/s due to typo</div></div>
<div class="tl-entry"><div class="tl-time">14:41 UTC</div><div class="tl-text">Rollback initiated</div></div>
<div class="tl-entry"><div class="tl-time">15:04 UTC</div><div class="tl-text">Service fully restored, all systems operational</div></div>
</div>
</div>
<div class="pm-section">
<h4>Root Cause</h4>
<p>The rate-limiting configuration used a YAML anchor that referenced a staging value of <code>0</code> instead of the production value of <code>10000</code>. The config validation pipeline did not check for zero-value rate limits.</p>
</div>
<div class="pm-section">
<h4>Action Items</h4>
<ul>
<li>✅ Add config validation rule to reject zero-value rate limits</li>
<li>✅ Implement canary deployments for gateway config changes</li>
<li>✅ Add pre-deployment smoke test that sends a real request</li>
<li>✅ Remove YAML anchors from production config files</li>
</ul>
</div>
</div>
</div>
<!-- Incident 2 -->
<div class="incident-card" data-severity="major">
<div class="incident-header" onclick="toggleIncident(this)">
<h3>Delayed Message Delivery — Telegram Webhook Queue Backup</h3>
<div class="incident-meta">
<span class="severity major">Major</span>
<span class="incident-duration">⏱ 28 min</span>
<span class="incident-date">Mar 8, 2026</span>
<span class="expand-icon">▼</span>
</div>
</div>
<div class="incident-body">
<div class="pm-section">
<h4>Summary</h4>
<p>Messages sent to AgentBox via Telegram experienced delays of 2–8 minutes due to a webhook processing queue backup. The queue worker ran out of memory after processing a series of large image attachments.</p>
</div>
<div class="pm-section">
<h4>Impact</h4>
<div class="impact-grid">
<div class="impact-item"><div class="value">35%</div><div class="label">Users Affected</div></div>
<div class="impact-item"><div class="value">28 min</div><div class="label">Duration</div></div>
<div class="impact-item"><div class="value">2–8 min</div><div class="label">Avg. Delay</div></div>
<div class="impact-item"><div class="value">0</div><div class="label">Messages Lost</div></div>
</div>
</div>
<div class="pm-section">
<h4>Root Cause</h4>
<p>The webhook worker loaded entire image payloads into memory for processing. A burst of high-resolution images (10+ MB each) caused the worker's heap to exceed its limit, triggering repeated garbage collection pauses.</p>
</div>
<div class="pm-section">
<h4>Action Items</h4>
<ul>
<li>✅ Stream image payloads to disk instead of buffering in memory</li>
<li>✅ Add per-message memory budget with graceful rejection</li>
<li>✅ Implement horizontal auto-scaling for webhook workers</li>
<li>⬜ Add queue depth monitoring with auto-scale trigger</li>
</ul>
</div>
</div>
</div>
<!-- Incident 3 -->
<div class="incident-card" data-severity="minor">
<div class="incident-header" onclick="toggleIncident(this)">
<h3>Dashboard Slowness — Elevated Latency on Analytics Queries</h3>
<div class="incident-meta">
<span class="severity minor">Minor</span>
<span class="incident-duration">⏱ 15 min</span>
<span class="incident-date">Feb 28, 2026</span>
<span class="expand-icon">▼</span>
</div>
</div>
<div class="incident-body">
<div class="pm-section">
<h4>Summary</h4>
<p>The analytics dashboard experienced 3–5x slower load times due to a missing database index on the events table. The issue became noticeable after the events table surpassed 10M rows.</p>
</div>
<div class="pm-section">
<h4>Impact</h4>
<div class="impact-grid">
<div class="impact-item"><div class="value">15%</div><div class="label">Users Affected</div></div>
<div class="impact-item"><div class="value">15 min</div><div class="label">Duration</div></div>
<div class="impact-item"><div class="value">3–5x</div><div class="label">Latency Increase</div></div>
<div class="impact-item"><div class="value">None</div><div class="label">Data Impact</div></div>
</div>
</div>
<div class="pm-section">
<h4>Root Cause</h4>
<p>A composite index on <code>(user_id, created_at)</code> was missing from the events table. The query planner fell back to a sequential scan once the table exceeded a size threshold.</p>
</div>
<div class="pm-section">
<h4>Action Items</h4>
<ul>
<li>✅ Add missing composite index</li>
<li>✅ Add slow-query alerting (>500ms) for all dashboard endpoints</li>
<li>⬜ Schedule quarterly index review for high-growth tables</li>
</ul>
</div>
</div>
</div>
<!-- Incident 4 -->
<div class="incident-card" data-severity="major">
<div class="incident-header" onclick="toggleIncident(this)">
<h3>Authentication Failures — OAuth Token Refresh Bug</h3>
<div class="incident-meta">
<span class="severity major">Major</span>
<span class="incident-duration">⏱ 22 min</span>
<span class="incident-date">Feb 14, 2026</span>
<span class="expand-icon">▼</span>
</div>
</div>
<div class="incident-body">
<div class="pm-section">
<h4>Summary</h4>
<p>Users who had been logged in for more than 24 hours were unable to perform actions requiring authentication. The OAuth token refresh endpoint was returning expired tokens due to a clock skew between auth servers.</p>
</div>
<div class="pm-section">
<h4>Impact</h4>
<div class="impact-grid">
<div class="impact-item"><div class="value">22%</div><div class="label">Users Affected</div></div>
<div class="impact-item"><div class="value">22 min</div><div class="label">Duration</div></div>
<div class="impact-item"><div class="value">Auth</div><div class="label">Service Impacted</div></div>
<div class="impact-item"><div class="value">0</div><div class="label">Data Loss</div></div>
</div>
</div>
<div class="pm-section">
<h4>Root Cause</h4>
<p>One of three auth servers had drifted 45 seconds ahead due to a failed NTP sync. Tokens issued by this server had an <code>iat</code> (issued-at) timestamp in the future, causing other servers to reject them as invalid.</p>
</div>
<div class="pm-section">
<h4>Action Items</h4>
<ul>
<li>✅ Add NTP sync monitoring with alerting on >5s drift</li>
<li>✅ Add 60-second clock skew tolerance to token validation</li>
<li>✅ Implement forced NTP sync on auth server startup</li>
</ul>
</div>
</div>
</div>
</div>
<footer class="footer">
<p>AgentBox is committed to transparency. All incidents with user impact are published here.</p>
<p style="margin-top:8px"><a href="index.html">Home</a> · <a href="status-page.html">System Status</a> · <a href="uptime-history.html">Uptime History</a></p>
</footer>
<script>
function toggleIncident(header) {
const card = header.closest('.incident-card');
card.classList.toggle('open');
}
// Filter functionality
document.querySelectorAll('.filter-btn').forEach(btn => {
btn.addEventListener('click', () => {
document.querySelectorAll('.filter-btn').forEach(b => b.classList.remove('active'));
btn.classList.add('active');
const filter = btn.dataset.filter;
document.querySelectorAll('.incident-card').forEach(card => {
if (filter === 'all' || card.dataset.severity === filter) {
card.style.display = '';
} else {
card.style.display = 'none';
}
});
});
});
</script>
</body>
</html>