Skip to content

Commit 964948e

Browse files
authored
Merge pull request #10 from appwrite/incident-2024-07-02
incident-2024-07-02
2 parents 0d96053 + 0bc2f75 commit 964948e

File tree

1 file changed

+32
-0
lines changed

1 file changed

+32
-0
lines changed

2024/07/02/readme.md

+32
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,32 @@
1+
2+
# Database Cluster Failure
3+
4+
### Date and Time:
5+
**Incident Start**: 2024-07-02 06:00 UTC
6+
**Incident End**: 2024-07-02 11:00 UTC
7+
**Report Prepared By**: **Shimon Newman**
8+
9+
### Summary
10+
One of MySQL databases cluster got restarted due to high memory consumption.
11+
During the MySQL restart on the cluster replica instances, data corruption was detected on both replica nodes.
12+
Data restoration was conducted from a backup.
13+
14+
### Incident Details:
15+
**Initial Detection**: A database monitoring heartbeat notification was missed.
16+
**Affected Components**: One of MySQL databases cluster.
17+
**User Impact**: The service was down while the primary node was restarted (about 1.5 hours).
18+
19+
### Root Cause Analysis:
20+
**Preliminary Findings**: High memory consumption on all MySQL nodes on one of MySQL databases cluster.
21+
**Investigation**: A rare situation occurred where all nodes of one of MySQL databases cluster restarted simultaneously due to high memory consumption.
22+
Service restarts were performed on all database nodes.
23+
As the cluster finished loading, the replicas issued errors related to data corruption.
24+
25+
### Resolution and Recovery:
26+
- **Immediate Actions**: Data restoration was immediately performed on both replicas from backups.
27+
28+
### Lessons Learned:
29+
- **What Went Well**: Immediate response ensured that replicas were restored to their previous state.
30+
- **What Could Be Improved**: Improved monitoring of high memory usage.
31+
- **Action Items**:
32+
- Add memory consumption alerts.

0 commit comments

Comments
 (0)