Skip to content

Conversation

@olegbet
Copy link
Contributor

@olegbet olegbet commented Jan 29, 2026

Signed-off-by: obetsun [email protected]

rh-pre-commit.version: 2.3.2
rh-pre-commit.check-secrets: ENABLED

@openshift-ci openshift-ci bot requested review from ggallen and mafh314 January 29, 2026 12:44
@openshift-ci
Copy link

openshift-ci bot commented Jan 29, 2026

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: olegbet

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@github-actions
Copy link
Contributor

🤖 Gemini AI Assistant Available

Hi @olegbet! I'm here to help with your pull request. You can interact with me using the following commands:

Available Commands

  • @gemini-cli /review - Request a comprehensive code review

    • Example: @gemini-cli /review Please focus on security and performance
  • @gemini-cli <your question> - Ask me anything about the codebase

    • Example: @gemini-cli How can I improve this function?
    • Example: @gemini-cli What are the best practices for error handling here?

How to Use

  1. Simply type one of the commands above in a comment on this PR
  2. I'll analyze your code and provide detailed feedback
  3. You can track my progress in the workflow logs

Permissions

Only OWNER, MEMBER, or COLLABORATOR users can trigger my responses. This ensures secure and appropriate usage.


This message was automatically added to help you get started with the Gemini AI assistant. Feel free to delete this comment if you don't need assistance.

@github-actions
Copy link
Contributor

🤖 Hi @olegbet, I've received your request, and I'm working on it now! You can track my progress in the logs for more details.

@olegbet olegbet force-pushed the stone-prod-p02_ingester_querier_oomkilled branch from af9039e to 1e8d373 Compare January 29, 2026 13:07
@konflux-ci-qe-bot
Copy link

🤖 Pipeline Failure Analysis

Category: Infrastructure

Pipeline failed due to a DNS resolution issue preventing the Prow job from connecting to the OpenShift cluster API server.

📋 Technical Details

Immediate Cause

The must-gather, gather-extra, and redhat-appstudio-gather steps failed because they could not resolve the DNS for the Kubernetes API server hostname api.konflux-4-17-us-west-2-bvm48.konflux-qe.devcluster.openshift.com. This resulted in dial tcp: lookup ...: no such host errors.

Contributing Factors

The appstudio-e2e-tests/redhat-appstudio-e2e step was terminated prematurely with a terminated signal. This is likely a secondary effect of the underlying infrastructure instability, possibly due to timeouts resulting from the inability to communicate with the cluster. The supplemental context from the must-gather artifact confirms the pervasive DNS resolution failures.

Impact

The inability to resolve the cluster's API server hostname prevented the Prow job from collecting essential audit logs and diagnostic data. This fundamental infrastructure failure also led to the premature termination of the e2e test execution, blocking the successful completion of the job.

🔍 Evidence

appstudio-e2e-tests/gather-audit-logs

Category: infrastructure
Root Cause: The must-gather tool failed to resolve the DNS for the Kubernetes API server hostname api.konflux-4-17-us-west-2-bvm48.konflux-qe.devcluster.openshift.com, preventing it from connecting and collecting audit logs.

Logs:

artifacts/appstudio-e2e-tests/gather-audit-logs/gather-audit-logs.log line 3
[must-gather      ] OUT 2026-01-29T13:20:57.689731705Z Get "https://api.konflux-4-17-us-west-2-bvm48.konflux-qe.devcluster.openshift.com:6443/apis/image.openshift.io/v1/namespaces/openshift/imagestreams/must-gather": dial tcp: lookup api.konflux-4-17-us-west-2-bvm48.konflux-qe.devcluster.openshift.com on 172.30.0.10:53: no such host
artifacts/appstudio-e2e-tests/gather-audit-logs/gather-audit-logs.log line 18
error getting cluster version: Get "https://api.konflux-4-17-us-west-2-bvm48.konflux-qe.devcluster.openshift.com:6443/apis/config.openshift.io/v1/clusterversions/version": dial tcp: lookup api.konflux-4-17-us-west-2-bvm48.konflux-qe.devcluster.openshift.com on 172.30.0.10:53: no such host
artifacts/appstudio-e2e-tests/gather-audit-logs/gather-audit-logs.log line 24
error getting cluster operators: Get "https://api.konflux-4-17-us-west-2-bvm48.konflux-qe.devcluster.openshift.com:6443/apis/config.openshift.io/v1/clusteroperators": dial tcp: lookup api.konflux-4-17-us-west-2-bvm48.konflux-qe.devcluster.openshift.com on 172.30.0.10:53: no such host
artifacts/appstudio-e2e-tests/gather-audit-logs/gather-audit-logs.log line 36
Error running must-gather collection:
    creating temp namespace: Post "https://api.konflux-4-17-us-west-2-bvm48.konflux-qe.devcluster.openshift.com:6443/api/v1/namespaces": dial tcp: lookup api.konflux-4-17-us-west-2-bvm48.konflux-qe.devcluster.openshift.com on 172.30.0.10:53: no such host
artifacts/appstudio-e2e-tests/gather-audit-logs/gather-audit-logs.log line 70
error getting cluster version: Get "https://api.konflux-4-17-us-west-2-bvm48.konflux-qe.devcluster.openshift.com:6443/apis/config.openshift.io/v1/clusterversions/version": dial tcp: lookup api.konflux-4-17-us-west-2-bvm48.konflux-qe.devcluster.openshift.com on 172.30.0.10:53: no such host
artifacts/appstudio-e2e-tests/gather-audit-logs/gather-audit-logs.log line 80
error running backup collection: Get "https://api.konflux-4-17-us-west-2-bvm48.konflux-qe.devcluster.openshift.com:6443/api?timeout=32s": dial tcp: lookup api.konflux-4-17-us-west-2-bvm48.konflux-qe.devcluster.openshift.com on 172.30.0.10:53: no such host
artifacts/appstudio-e2e-tests/gather-audit-logs/gather-audit-logs.log line 88
error: creating temp namespace: Post "https://api.konflux-4-17-us-west-2-bvm48.konflux-qe.devcluster.openshift.com:6443/api/v1/namespaces": dial tcp: lookup api.konflux-4-17-us-west-2-bvm48.konflux-qe.devcluster.openshift.com on 172.30.0.10:53: no such host

appstudio-e2e-tests/gather-extra

Category: infrastructure
Root Cause: The failure is due to a DNS resolution issue preventing the system from reaching the Kubernetes API server. This could be caused by network misconfiguration, DNS server problems, or issues with the cluster's internal networking.

Logs:

artifacts/appstudio-e2e-tests/gather-extra/gather-extra.log line 3
E0129 13:20:44.368907      31 memcache.go:265] couldn't get current server API group list: Get "https://api.konflux-4-17-us-west-2-bvm48.konflux-qe.devcluster.openshift.com:6443/api?timeout=5s": dial tcp: lookup api.konflux-4-17-us-west-2-bvm48.konflux-qe.devcluster.openshift.com on 172.30.0.10:53: no such host
artifacts/appstudio-e2e-tests/gather-extra/gather-extra.log line 9
Unable to connect to the server: dial tcp: lookup api.konflux-4-17-us-west-2-bvm48.konflux-qe.devcluster.openshift.com on 172.30.0.10:53: no such host

appstudio-e2e-tests/gather-must-gather

Category: infrastructure
Root Cause: The oc adm must-gather command failed due to network connectivity issues, specifically i/o timeout errors when attempting to reach the OpenShift API server at api.konflux-4-17-us-west-2-bvm48.konflux-qe.devcluster.openshift.com:6443. This prevented the collection of necessary diagnostic data.

Logs:

artifacts/appstudio-e2e-tests/gather-must-gather/log.txt line 20
Error running must-gather collection:
    creating temp namespace: Post "https://api.konflux-4-17-us-west-2-bvm48.konflux-qe.devcluster.openshift.com:6443/api/v1/namespaces": dial tcp [REDACTED: Public IP (ipv4)]: i/o timeout
artifacts/appstudio-e2e-tests/gather-must-gather/log.txt line 24
Falling back to 'oc adm inspect clusteroperators.v1.config.openshift.io' to collect basic cluster information.
artifacts/appstudio-e2e-tests/gather-must-gather/log.txt line 25
E0129 13:12:17.834318      54 memcache.go:265] couldn't get current server API group list: Get "https://api.konflux-4-17-us-west-2-bvm48.konflux-qe.devcluster.openshift.com:6443/api?timeout=32s": dial tcp [REDACTED: Public IP (ipv4)]: i/o timeout
artifacts/appstudio-e2e-tests/gather-must-gather/log.txt line 31
E0129 13:12:47.849914      54 memcache.go:265] couldn't get current server API group list: Get "https://api.konflux-4-17-us-west-2-bvm48.konflux-qe.devcluster.openshift.com:6443/api?timeout=32s": dial tcp [REDACTED: Public IP (ipv4)]: i/o timeout
artifacts/appstudio-e2e-tests/gather-must-gather/log.txt line 38
E0129 13:13:17.867377      54 memcache.go:265] couldn't get current server API group list: Get "https://api.konflux-4-17-us-west-2-bvm48.konflux-qe.devcluster.openshift.com:6443/api?timeout=32s": dial tcp [REDACTED: Public IP (ipv4)]: i/o timeout
artifacts/appstudio-e2e-tests/gather-must-gather/log.txt line 41
error running backup collection: Get "https://api.konflux-4-17-us-west-2-bvm48.konflux-qe.devcluster.openshift.com:6443/api?timeout=32s": dial tcp: lookup api.konflux-4-17-us-west-2-bvm48.konflux-qe.devcluster.openshift.com on 172.30.0.10:53: no such host
artifacts/appstudio-e2e-tests/gather-must-gather/log.txt line 42
error: creating temp namespace: Post "https://api.konflux-4-17-us-west-2-bvm48.konflux-qe.devcluster.openshift.com:6443/api/v1/namespaces": dial tcp [REDACTED: Public IP (ipv4)]: i/o timeout

appstudio-e2e-tests/redhat-appstudio-e2e

Category: infrastructure
Root Cause: The mage process was terminated prematurely by an external signal, likely due to a timeout or an issue with the underlying infrastructure executing the job.

Logs:

artifacts/appstudio-e2e-tests/redhat-appstudio-e2e/build-log.txt line 450
{"component":"entrypoint","file":"sigs.k8s.io/prow/pkg/entrypoint/run.go:173","func":"sigs.k8s.io/prow/pkg/entrypoint.Options.ExecuteProcess","level":"error","msg":"Entrypoint received interrupt: terminated","severity":"error","time":"2026-01-29T13:07:13Z"}
artifacts/appstudio-e2e-tests/redhat-appstudio-e2e/build-log.txt line 452
make: *** [Makefile:25: ci/test/e2e] Terminated
artifacts/appstudio-e2e-tests/redhat-appstudio-e2e/build-log.txt line 454
{"component":"entrypoint","file":"sigs.k8s.io/prow/pkg/entrypoint/run.go:267","func":"sigs.k8s.io/prow/pkg/entrypoint.gracefullyTerminate","level":"error","msg":"Process did not exit before 15s grace period","severity":"error","time":"2026-01-29T13:07:28Z"}
artifacts/appstudio-e2e-tests/redhat-appstudio-e2e/build-log.txt line 456
{"component":"entrypoint","error":"os: process already finished","file":"sigs.k8s.io/prow/pkg/entrypoint/run.go:269","func":"sigs.k8s.io/prow/pkg/entrypoint.gracefullyTerminate","level":"error","msg":"Could not kill process after grace period","severity":"error","time":"2026-01-29T13:07:28Z"}

appstudio-e2e-tests/redhat-appstudio-gather

Category: infrastructure
Root Cause: The oc commands are failing because they cannot resolve the hostname of the Kubernetes API server. This indicates a network or DNS configuration problem preventing access to the cluster.

Logs:

artifacts/appstudio-e2e-tests/redhat-appstudio-gather/log.txt line 20
E0129 13:21:59.308972      30 memcache.go:265] couldn't get current server API group list: Get "https://api.konflux-4-17-us-west-2-bvm48.konflux-qe.devcluster.openshift.com:6443/api?timeout=5s": dial tcp: lookup api.konflux-4-17-us-west-2-bvm48.konflux-qe.devcluster.openshift.com on 172.30.0.10:53: no such host
artifacts/appstudio-e2e-tests/redhat-appstudio-gather/log.txt line 3786
Unable to connect to the server: dial tcp: lookup api.konflux-4-17-us-west-2-bvm48.konflux-qe.devcluster.openshift.com on 172.30.0.10:53: no such host
artifacts/appstudio-e2e-tests/redhat-appstudio-gather/log.txt line 7712
Error running must-gather collection:
    creating temp namespace: Post "https://api.konflux-4-17-us-west-2-bvm48.konflux-qe.devcluster.openshift.com:6443/api/v1/namespaces": dial tcp: lookup api.konflux-4-17-us-west-2-bvm48.konflux-qe.devcluster.openshift.com on 172.30.0.10:53: no such host

Analysis powered by prow-failure-analysis | Build: 2016855047408717824

@rh-hemartin
Copy link
Contributor

rh-hemartin commented Jan 29, 2026

Change it in all clusters. You can use something like:

parallel cp components/vector-kubearchive-log-collector/production/stone-prod-p02/loki-helm-prod-values.yaml  ::: components/vector-kubearchive-log-collector/production/*

To copy the config to all clusters, then remove the file from the base and empty-base folders.

@rh-hemartin
Copy link
Contributor

/lgtm

dir: /var/loki/wal
checkpoint_duration: 5m # Create checkpoints every 5 minutes
flush_on_shutdown: true # Ensure data is flushed before shutdown
replay_memory_ceiling: 4GB # 75% of 6Gi limit - PREVENTS OOM during replay

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the memory limit in the development environment 6Gi?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see two resources in the production file that have 6Gi memory limits, but those resources in the development environment are set to 512Mi.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, @beerparty , probably WAL should be tested on the real resorce requests on development as well.

@konflux-ci-qe-bot
Copy link

🤖 Pipeline Failure Analysis

Category: Timeout

The Red Hat AppStudio end-to-end tests timed out, preventing the successful completion of the Prow job.

📋 Technical Details

Immediate Cause

The appstudio-e2e-tests/redhat-appstudio-e2e step failed due to a timeout. The process exceeded the allocated runtime of 2 hours and did not exit gracefully within the subsequent 15-second grace period.

Contributing Factors

Analysis of the cluster state reveals several Argo CD Applications and ApplicationSets in an "OutOfSync" or "Missing" state. This indicates potential configuration drift or instability within the deployed applications, which could lead to increased resource consumption or delays during the execution of end-to-end tests that rely on these services. Specific examples include tekton-kueue-webhook-rolebinding and squid applications being out of sync, and some ApplicationSets having applications in a "Missing" state.

Impact

The timeout of the end-to-end tests directly blocked the progression of the Prow job, preventing any subsequent steps from executing and leading to an overall job failure. This hinders the verification of the Red Hat AppStudio infrastructure deployment.

🔍 Evidence

appstudio-e2e-tests/redhat-appstudio-e2e

Category: timeout
Root Cause: The end-to-end tests for Red Hat AppStudio exceeded the allocated runtime, leading to a timeout failure. The specific reason for the extended execution time is not evident from the provided logs.

Logs:

artifacts/appstudio-e2e-tests/redhat-appstudio-e2e/step.log line 671
{"component":"entrypoint","file":"sigs.k8s.io/prow/pkg/entrypoint/run.go:169","func":"sigs.k8s.io/prow/pkg/entrypoint.Options.ExecuteProcess","level":"error","msg":"Process did not finish before 2h0m0s timeout","severity":"error","time":"2026-01-29T17:19:45Z"}
artifacts/appstudio-e2e-tests/redhat-appstudio-e2e/step.log line 672
{"component":"entrypoint","file":"sigs.k8s.io/prow/pkg/entrypoint/run.go:267","func":"sigs.k8s.io/prow/pkg/entrypoint.gracefullyTerminate","level":"error","msg":"Process did not exit before 15s grace period","severity":"error","time":"2026-01-29T17:20:00Z"}

Analysis powered by prow-failure-analysis | Build: 2016893289873018880

@olegbet olegbet changed the title ingester and querier pods are getting OOMKilled on stone-prod-p02 WIP: ingester and querier pods are getting OOMKilled on stone-prod-p02 Feb 2, 2026
@olegbet olegbet force-pushed the stone-prod-p02_ingester_querier_oomkilled branch from 1e8d373 to 8f5b5fd Compare February 2, 2026 15:30
@openshift-ci openshift-ci bot removed the lgtm label Feb 2, 2026
@openshift-ci
Copy link

openshift-ci bot commented Feb 2, 2026

New changes are detected. LGTM label has been removed.

@konflux-ci-qe-bot
Copy link

🤖 Pipeline Failure Analysis

Category: Timeout

The Prow job appstudio-e2e-tests timed out during test execution, preventing the completion of the end-to-end test suite.

📋 Technical Details

Immediate Cause

The appstudio-e2e-tests/redhat-appstudio-e2e step in the Prow job exceeded its allocated 2-hour timeout. The process did not complete within the expected timeframe, leading to its termination.

Contributing Factors

While the direct cause is a timeout, the additional_context reveals potential contributing factors. Several Argo CD applications are in a degraded state (build-service-in-cluster-local), and numerous ApplicationSet resources are reporting OutOfSync and Missing health statuses. These conditions suggest that the cluster's state might not be optimal, potentially leading to extended execution times for the end-to-end tests as they interact with these resources. The tektonaddons.json also indicates that the Tekton Addon is not fully ready, with tkn-cli-serve deployment not ready, which might indirectly affect test execution.

Impact

The timeout prevented the successful execution and completion of the redhat-appstudio-e2e test suite. This means that the end-to-end validation of the AppStudio infrastructure could not be performed, blocking the successful completion of the Prow job and potentially delaying the integration of changes.

🔍 Evidence

appstudio-e2e-tests/redhat-appstudio-e2e

Category: timeout
Root Cause: The test execution timed out. This could be due to the tests taking longer than expected to complete, or an issue with the test environment preventing them from finishing within the allocated time.

Logs:

artifacts/appstudio-e2e-tests/redhat-appstudio-e2e/step.log:1233
{"component":"entrypoint","file":"sigs.k8s.io/prow/pkg/entrypoint/run.go:169","func":"sigs.k8s.io/prow/pkg/entrypoint.Options.ExecuteProcess","level":"error","msg":"Process did not finish before 2h0m0s timeout","severity":"error","time":"2026-02-02T17:34:19Z"}
artifacts/appstudio-e2e-tests/redhat-appstudio-e2e/step.log:1235
{"component":"entrypoint","file":"sigs.k8s.io/prow/pkg/entrypoint/run.go:267","func":"sigs.k8s.io/prow/pkg/entrypoint.gracefullyTerminate","level":"error","msg":"Process did not exit before 15s grace period","severity":"error","time":"2026-02-02T17:34:34Z"}

Analysis powered by prow-failure-analysis | Build: 2018346197860749312

@konflux-ci-qe-bot
Copy link

🤖 Pipeline Failure Analysis

Category: Timeout

The appstudio-e2e-tests job timed out due to underlying issues with Argo CD application synchronization and Tekton component readiness, preventing the E2E tests from completing within the allocated time.

📋 Technical Details

Immediate Cause

The appstudio-e2e-tests/redhat-appstudio-e2e step timed out after exceeding the 2-hour execution limit. This timeout indicates that the processes being run by the step, which are expected to complete within this timeframe, did not finish.

Contributing Factors

Several factors within the cluster likely contributed to the extended execution time:

  • Argo CD Synchronization Issues: Multiple Argo CD applications and ApplicationSets were found in an OutOfSync or Degraded state. Specifically, the build-service-in-cluster-local application is Degraded because its deployment exceeded its progress deadline, and other key ApplicationSets like application-api and build-service are OutOfSync. This suggests that the desired state defined in Git is not being correctly applied or synchronized within the cluster.
  • Tekton Component Readiness: The tektonconfigs.json artifact shows Error conditions for ComponentsReady and Ready types, indicating that TektonAddon components are not in a ready state and require reconciliation. Similarly, tektonaddons.json notes a false InstallerSetReady condition due to the tkn-cli-serve deployment not being ready. These issues with Tekton's core components could impede the execution of pipelines and related processes.

Impact

The prolonged synchronization and deployment times, caused by the Argo CD and Tekton issues, prevented the E2E tests from executing and completing within the Prow job's timeout limit. This blocked the successful validation of the infrastructure deployment.

🔍 Evidence

appstudio-e2e-tests/redhat-appstudio-e2e

Category: timeout
Root Cause: The end-to-end tests timed out because the underlying processes, likely related to Argo CD application synchronization and deployment, took too long to complete. This indicates potential instability or slowness in the test environment's resource provisioning or configuration.

Logs:

artifacts/appstudio-e2e-tests/redhat-appstudio-e2e/latest/build.log:1686
{"component":"entrypoint","file":"sigs.k8s.io/prow/pkg/entrypoint/run.go:169","func":"sigs.k8s.io/prow/pkg/entrypoint.Options.ExecuteProcess","level":"error","msg":"Process did not finish before 2h0m0s timeout","severity":"error","time":"2026-02-02T20:17:57Z"}
artifacts/appstudio-e2e-tests/redhat-appstudio-e2e/latest/build.log:1699
{"component":"entrypoint","file":"sigs.k8s.io/prow/pkg/entrypoint/run.go:267","func":"sigs.k8s.io/prow/pkg/entrypoint.gracefullyTerminate","level":"error","msg":"Process did not exit before 15s grace period","severity":"error","time":"2026-02-02T20:18:12Z"}

Analysis powered by prow-failure-analysis | Build: 2018384093615493120

@openshift-ci
Copy link

openshift-ci bot commented Feb 2, 2026

@olegbet: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/appstudio-e2e-tests 4af92a4 link true /test appstudio-e2e-tests

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

Signed-off-by: obetsun <[email protected]>
Assisted-by: Claude

rh-pre-commit.version: 2.3.2
rh-pre-commit.check-secrets: ENABLED
Signed-off-by: obetsun <[email protected]>

rh-pre-commit.version: 2.3.2
rh-pre-commit.check-secrets: ENABLED
Signed-off-by: obetsun <[email protected]>

rh-pre-commit.version: 2.3.2
rh-pre-commit.check-secrets: ENABLED
…tefulSet

Signed-off-by: obetsun <[email protected]>

rh-pre-commit.version: 2.3.2
rh-pre-commit.check-secrets: ENABLED
@olegbet olegbet force-pushed the stone-prod-p02_ingester_querier_oomkilled branch from 4af92a4 to f597d37 Compare February 3, 2026 09:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants