Skip to content

Commit 3fb6e12

Browse files
committed
CNTRLPLANE-1575: Add support for event-ttl in Kube API Server Operator
API PR in openshift/api#2520 Feature Gate PR in openshift/api#2525 Signed-off-by: Thomas Jungblut <[email protected]>
1 parent 7f59958 commit 3fb6e12

File tree

1 file changed

+249
-0
lines changed

1 file changed

+249
-0
lines changed
Lines changed: 249 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,249 @@
1+
---
2+
title: event-ttl
3+
authors:
4+
- "@tjungblu"
5+
- "CursorAI"
6+
reviewers:
7+
- benluddy
8+
- p0lyn0mial
9+
approvers:
10+
- sjenning
11+
api-approvers:
12+
- JoelSpeed
13+
creation-date: 2025-10-08
14+
last-updated: 2025-10-08
15+
tracking-link:
16+
- https://issues.redhat.com/browse/OCPSTRAT-2095
17+
- https://issues.redhat.com/browse/CNTRLPLANE-1539
18+
- https://github.com/openshift/api/pull/2520
19+
status: proposed
20+
see-also:
21+
replaces:
22+
superseded-by:
23+
---
24+
25+
# Event TTL Configuration
26+
27+
## Summary
28+
29+
This enhancement describes a configuration option in the operator API to configure the event-ttl setting for the kube-apiserver. The event-ttl setting controls how long events are retained in etcd before being automatically deleted.
30+
31+
Currently, OpenShift uses a default event-ttl of 3 hours (180 minutes), while upstream Kubernetes uses 1 hour. This enhancement allows customers to configure this value based on their specific requirements, with a range of 5 minutes to 3 hours (180 minutes), with a default of 180 minutes (3 hours).
32+
33+
## Motivation
34+
35+
The event-ttl setting in kube-apiserver controls the retention period for events in etcd. Events are automatically deleted after this duration to prevent etcd from growing indefinitely. Different customers have different requirements for event retention:
36+
37+
- Some customers need longer retention for compliance or debugging purposes
38+
- Others may want shorter retention to reduce etcd storage usage
39+
- The current fixed value of 3 hours may not suit all use cases
40+
41+
The maximum value of 3 hours (180 minutes) was chosen to align with the current OpenShift default value. While upstream Kubernetes uses 1 hour as the default, OpenShift's 3-hour default was established to support CI runs that may need to retain events for the entire duration of a test run. For customer use cases, the 3-hour maximum provides sufficient retention for compliance and debugging needs, while the 1-hour upstream default would be more appropriate for general customer workloads.
42+
43+
### Goals
44+
45+
1. Allow customers to configure the event-ttl setting for kube-apiserver through the OpenShift API
46+
2. Provide a reasonable range of values (5 minutes to 3 hours) that covers most customer needs
47+
3. Maintain backward compatibility with the current default of 3 hours (180 minutes)
48+
4. Ensure the configuration is properly validated and applied
49+
50+
### Non-Goals
51+
52+
- Changing the default event-ttl value (will remain 3 hours/180 minutes)
53+
- Supporting event-ttl values outside the recommended range (5-180 minutes)
54+
- Modifying the underlying etcd compaction behavior beyond what the event-ttl setting provides
55+
56+
## Proposal
57+
58+
We propose to add an `eventTTLMinutes` field to the operator API that allows customers to configure the event-ttl setting for kube-apiserver.
59+
60+
### User Stories
61+
62+
#### Story 1: Storage Optimization
63+
As a cluster administrator with limited etcd storage, I want to configure a shorter event retention period so that I can reduce etcd storage usage while maintaining sufficient event history for troubleshooting. Event data can consume significant etcd storage over time, and reducing the retention period can help manage storage growth.
64+
65+
#### Story 2: Default Behavior
66+
As a cluster administrator, I want the current default behavior to be preserved so that existing clusters continue to work without changes.
67+
68+
### API Extensions
69+
70+
This enhancement modifies the operator API by adding a new `eventTTLMinutes` field.
71+
72+
### Workflow Description
73+
74+
The workflow for configuring event-ttl is straightforward:
75+
76+
1. **Cluster Administrator** accesses the OpenShift cluster via CLI or web console
77+
2. **Cluster Administrator** edits the operator configuration resource
78+
3. **Cluster Administrator** sets the `eventTTLMinutes` field to the desired value in minutes (e.g., 60, 180)
79+
4. **kube-apiserver-operator** detects the configuration change
80+
5. **kube-apiserver-operator** updates the kube-apiserver deployment with the new configuration
81+
6. **kube-apiserver** restarts with the new event-ttl setting
82+
7. **etcd** begins using the new event retention policy for future events
83+
84+
The configuration change takes effect immediately for new events, while existing events continue to use their original TTL until they expire.
85+
86+
### Topology Considerations
87+
88+
#### Hypershift / Hosted Control Planes
89+
90+
This enhancement does not apply to HyperShift. HyperShift uses the upstream Kubernetes default of 1 hour for event-ttl, and there have been no significant requests from HyperShift users to modify this configuration. The 3-hour default in OpenShift was established to support internal CI processes, which are not applicable to HyperShift deployments.
91+
92+
#### Standalone Clusters
93+
94+
This enhancement is fully applicable to standalone OpenShift clusters. The event-ttl configuration will be applied to the kube-apiserver running in the control plane, affecting event retention in the cluster's etcd.
95+
96+
#### Single-node Deployments or MicroShift
97+
98+
For single-node OpenShift (SNO) deployments, this enhancement will work as expected. The event-ttl configuration will be applied to the kube-apiserver running on the single node.
99+
100+
For MicroShift, this enhancement is not directly applicable as MicroShift uses a different architecture and may not have the same event-ttl configuration options. However, if MicroShift adopts similar event management, the same principles would apply.
101+
102+
### Implementation Details/Notes/Constraints
103+
104+
The proposed API looks like this:
105+
106+
```yaml
107+
apiVersion: operator.openshift.io/v1
108+
kind: KubeAPIServer
109+
metadata:
110+
name: cluster
111+
spec:
112+
eventTTLMinutes: 60 # Integer value in minutes, e.g., 60, 180
113+
```
114+
115+
The `eventTTLMinutes` field will be an integer value representing minutes. The field will be validated to ensure it falls within the required range of 5-180 minutes. In the upstream Kubernetes API server configuration, `event-ttl` is typically set as a standalone parameter, so placing `eventTTLMinutes` directly under the operator spec without additional nesting maintains consistency with upstream patterns.
116+
117+
The API design is based on the changes in [openshift/api PR #2520](https://github.com/openshift/api/pull/2520), which includes:
118+
119+
```go
120+
type KubeAPIServerSpec struct {
121+
StaticPodOperatorSpec `json:",inline"`
122+
123+
// eventTTLMinutes specifies the amount of time that the events are stored before being deleted.
124+
// The TTL is allowed between 5 minutes minimum up to a maximum of 180 minutes (3 hours).
125+
//
126+
// Lowering this value will reduce the storage required in etcd but will increase CPU usage due to
127+
// more frequent etcd compaction operations. Note that this setting will only apply to new events
128+
// being created and will not update existing events.
129+
//
130+
// When omitted this means no opinion, and the platform is left to choose a reasonable default, which is subject to change over time.
131+
// The current default value is 3h (180 minutes).
132+
//
133+
// +kubebuilder:validation:Minimum=5
134+
// +kubebuilder:validation:Maximum=180
135+
// +optional
136+
EventTTLMinutes int32 `json:"eventTTLMinutes,omitempty"`
137+
}
138+
```
139+
140+
### Impact of Lower TTL Values
141+
142+
Setting the event-ttl to values lower than the upstream default of 1 hour will primarily impact:
143+
144+
1. **etcd Compaction Bandwidth**: With faster expiring events, etcd will need more bandwidth to remove expired events.
145+
146+
2. **etcd CPU Usage**: More expensive compaction operations will increase CPU usage on etcd nodes, as the compaction process requires CPU cycles to identify and remove expired events.
147+
148+
3. **Event Availability**: Events will be deleted more quickly, potentially reducing the time window available for debugging and troubleshooting.
149+
150+
The main reason for this impact is that with faster expiring events, the system needs to delete events much more frequently, increasing the overhead of the cleanup process.
151+
152+
### Risks and Mitigations
153+
154+
**Risk**: Customers might set extremely low values that could impact etcd performance.
155+
**Mitigation**: The API validation ensures values are within a reasonable range (5-180 minutes).
156+
157+
158+
### Drawbacks
159+
160+
- Adds complexity to the configuration API
161+
- Additional validation and error handling required
162+
163+
## Alternatives (Not Implemented)
164+
165+
1. **Hardcoded Values**: Keep the current fixed value of 3 hours
166+
- **Rejected**: Does not meet customer requirements for configurability
167+
168+
2. **Environment Variable**: Use environment variables instead of API configuration
169+
- **Rejected**: Less user-friendly and harder to manage
170+
171+
3. **Separate CRD**: Create a separate CRD for event configuration
172+
- **Rejected**: Overkill for a single setting, better to include in existing APIServer resource
173+
174+
## Test Plan
175+
176+
**Note:** *Section not required until targeted at a release.*
177+
178+
The test plan will include:
179+
180+
1. **Unit Tests**: Test the API validation and parsing logic
181+
2. **Integration Tests**: Test that the configuration is properly applied to kube-apiserver
182+
3. **E2E Tests**: Test that events are properly deleted after the configured TTL
183+
4. **Performance Tests**: Test the impact of different TTL values on etcd performance
184+
185+
## Graduation Criteria
186+
187+
### Dev Preview -> Tech Preview
188+
189+
- API is implemented and validated
190+
- Basic functionality works end-to-end
191+
- Documentation is available
192+
- Sufficient test coverage
193+
194+
### Tech Preview -> GA
195+
196+
- More comprehensive testing (upgrade, downgrade, scale)
197+
- Performance testing with various TTL values
198+
- User feedback incorporated
199+
- Documentation updated in openshift-docs
200+
201+
### Removing a deprecated feature
202+
203+
This enhancement does not remove any existing features. It only adds new configuration options while maintaining backward compatibility with the existing default behavior.
204+
205+
## Upgrade / Downgrade Strategy
206+
207+
### Upgrade Strategy
208+
209+
- Existing clusters will continue to use the default 3-hour (180-minute) TTL
210+
- No changes required for existing clusters
211+
- New configuration option is available immediately
212+
213+
### Downgrade Strategy
214+
215+
- Configuration will be ignored by older versions
216+
- No impact on cluster functionality
217+
- Events will continue to use the default TTL (180 minutes)
218+
219+
## Version Skew Strategy
220+
221+
- The event-ttl setting is a kube-apiserver configuration
222+
- No coordination required with other components
223+
- Version skew is not a concern for this enhancement
224+
225+
## Operational Aspects of API Extensions
226+
227+
This enhancement modifies the operator API but does not add new API extensions. The impact is limited to:
228+
229+
- Configuration validation in the kube-apiserver-operator
230+
- Application of the setting to kube-apiserver deployment
231+
- No impact on API availability or performance
232+
233+
## Support Procedures
234+
235+
### Detection
236+
237+
- Configuration can be verified by checking the operator configuration resource
238+
- kube-apiserver logs will show the configured event-ttl value
239+
- etcd metrics can be monitored for compaction frequency
240+
241+
### Troubleshooting
242+
243+
- If events are not being deleted as expected, check the event-ttl configuration
244+
- Monitor etcd compaction metrics for unusual patterns
245+
246+
## Implementation History
247+
248+
- 2025-10-08: Initial enhancement proposal
249+

0 commit comments

Comments
 (0)