-
Notifications
You must be signed in to change notification settings - Fork 321
WIP: ingester and querier pods are getting OOMKilled on stone-prod-p02 #10283
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
WIP: ingester and querier pods are getting OOMKilled on stone-prod-p02 #10283
Conversation
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: olegbet The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
🤖 Gemini AI Assistant AvailableHi @olegbet! I'm here to help with your pull request. You can interact with me using the following commands: Available Commands
How to Use
PermissionsOnly OWNER, MEMBER, or COLLABORATOR users can trigger my responses. This ensures secure and appropriate usage. This message was automatically added to help you get started with the Gemini AI assistant. Feel free to delete this comment if you don't need assistance. |
|
🤖 Hi @olegbet, I've received your request, and I'm working on it now! You can track my progress in the logs for more details. |
af9039e to
1e8d373
Compare
🤖 Pipeline Failure AnalysisCategory: Infrastructure Pipeline failed due to a DNS resolution issue preventing the Prow job from connecting to the OpenShift cluster API server. 📋 Technical DetailsImmediate CauseThe Contributing FactorsThe ImpactThe inability to resolve the cluster's API server hostname prevented the Prow job from collecting essential audit logs and diagnostic data. This fundamental infrastructure failure also led to the premature termination of the e2e test execution, blocking the successful completion of the job. 🔍 Evidenceappstudio-e2e-tests/gather-audit-logsCategory: Logs:
|
|
Change it in all clusters. You can use something like: To copy the config to all clusters, then remove the file from the base and empty-base folders. |
|
/lgtm |
| dir: /var/loki/wal | ||
| checkpoint_duration: 5m # Create checkpoints every 5 minutes | ||
| flush_on_shutdown: true # Ensure data is flushed before shutdown | ||
| replay_memory_ceiling: 4GB # 75% of 6Gi limit - PREVENTS OOM during replay |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the memory limit in the development environment 6Gi?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see two resources in the production file that have 6Gi memory limits, but those resources in the development environment are set to 512Mi.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point, @beerparty , probably WAL should be tested on the real resorce requests on development as well.
🤖 Pipeline Failure AnalysisCategory: Timeout The Red Hat AppStudio end-to-end tests timed out, preventing the successful completion of the Prow job. 📋 Technical DetailsImmediate CauseThe Contributing FactorsAnalysis of the cluster state reveals several Argo CD Applications and ApplicationSets in an "OutOfSync" or "Missing" state. This indicates potential configuration drift or instability within the deployed applications, which could lead to increased resource consumption or delays during the execution of end-to-end tests that rely on these services. Specific examples include ImpactThe timeout of the end-to-end tests directly blocked the progression of the Prow job, preventing any subsequent steps from executing and leading to an overall job failure. This hinders the verification of the Red Hat AppStudio infrastructure deployment. 🔍 Evidenceappstudio-e2e-tests/redhat-appstudio-e2eCategory: Logs:
|
1e8d373 to
8f5b5fd
Compare
|
New changes are detected. LGTM label has been removed. |
🤖 Pipeline Failure AnalysisCategory: Timeout The Prow job 📋 Technical DetailsImmediate CauseThe Contributing FactorsWhile the direct cause is a timeout, the ImpactThe timeout prevented the successful execution and completion of the 🔍 Evidenceappstudio-e2e-tests/redhat-appstudio-e2eCategory: Logs:
|
🤖 Pipeline Failure AnalysisCategory: Timeout The 📋 Technical DetailsImmediate CauseThe Contributing FactorsSeveral factors within the cluster likely contributed to the extended execution time:
ImpactThe prolonged synchronization and deployment times, caused by the Argo CD and Tekton issues, prevented the E2E tests from executing and completing within the Prow job's timeout limit. This blocked the successful validation of the infrastructure deployment. 🔍 Evidenceappstudio-e2e-tests/redhat-appstudio-e2eCategory: Logs:
|
|
@olegbet: The following test failed, say
Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
Signed-off-by: obetsun <[email protected]> Assisted-by: Claude rh-pre-commit.version: 2.3.2 rh-pre-commit.check-secrets: ENABLED
Signed-off-by: obetsun <[email protected]> rh-pre-commit.version: 2.3.2 rh-pre-commit.check-secrets: ENABLED
Signed-off-by: obetsun <[email protected]> rh-pre-commit.version: 2.3.2 rh-pre-commit.check-secrets: ENABLED
…tefulSet Signed-off-by: obetsun <[email protected]> rh-pre-commit.version: 2.3.2 rh-pre-commit.check-secrets: ENABLED
4af92a4 to
f597d37
Compare
Signed-off-by: obetsun [email protected]
rh-pre-commit.version: 2.3.2
rh-pre-commit.check-secrets: ENABLED