Skip to content

Conversation

@oswcab
Copy link
Contributor

@oswcab oswcab commented Feb 2, 2026

After migrating from OLM to Helm deployment, custom-resources.yaml was retained but never updated with correct selectors. This caused the Service and ServiceMonitor selecting no pods because of a selector mismatch.

The root cause was the OLM deployment used the instance label 'app.kubernetes.io/instance: cluster' while Helm uses instance label 'app.kubernetes.io/instance: external-secrets-operator'.

To maximize maintainability, this commit uses a hybrid approach where the metric services are enabled on the values.yaml file so that Helm creates the services with correct selectors automatically. For the ServiceMonitors, they are created using a patch as 'kustomize --enable-helm' doesn't support Capabilities.APIVersions check, the condition in the Helm Charts to be able to enable them.

Contributes to: KFLUXINFRA-2513

@openshift-ci openshift-ci bot requested review from eisraeli and sadlerap February 2, 2026 18:56
@openshift-ci openshift-ci bot added the approved label Feb 2, 2026
@github-actions
Copy link
Contributor

github-actions bot commented Feb 2, 2026

🤖 Gemini AI Assistant Available

Hi @oswcab! I'm here to help with your pull request. You can interact with me using the following commands:

Available Commands

  • @gemini-cli /review - Request a comprehensive code review

    • Example: @gemini-cli /review Please focus on security and performance
  • @gemini-cli <your question> - Ask me anything about the codebase

    • Example: @gemini-cli How can I improve this function?
    • Example: @gemini-cli What are the best practices for error handling here?

How to Use

  1. Simply type one of the commands above in a comment on this PR
  2. I'll analyze your code and provide detailed feedback
  3. You can track my progress in the workflow logs

Permissions

Only OWNER, MEMBER, or COLLABORATOR users can trigger my responses. This ensures secure and appropriate usage.


This message was automatically added to help you get started with the Gemini AI assistant. Feel free to delete this comment if you don't need assistance.

@github-actions
Copy link
Contributor

github-actions bot commented Feb 2, 2026

🤖 Hi @oswcab, I've received your request, and I'm working on it now! You can track my progress in the logs for more details.

@oswcab oswcab force-pushed the fix/eso-servicemonitor-selectors branch from 303e63f to 3f354df Compare February 2, 2026 18:57
Copy link
Contributor

@hugares hugares left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@oswcab
Copy link
Contributor Author

oswcab commented Feb 2, 2026

/hold

After migrating from OLM to Helm deployment, custom-resources.yaml was
retained but never updated with correct selectors. This caused the
Service and ServiceMonitor selecting no pods because of a selector
mismatch.

The root cause was the OLM deployment used the instance label
'app.kubernetes.io/instance: cluster' while Helm uses instance label
'app.kubernetes.io/instance: external-secrets-operator'.

To maximize maintainability, this commit uses a hybrid approach where
the metric services are enabled on the values.yaml file so that Helm
creates the services with correct selectors automatically. For the
ServiceMonitors, they are created using a patch as 'kustomize --enable-helm'
doesn't support Capabilities.APIVersions check, the condition in the
Helm Charts to be able to enable them.

Contributes to: KFLUXINFRA-2513
@oswcab oswcab force-pushed the fix/eso-servicemonitor-selectors branch from 3f354df to d3c98e3 Compare February 2, 2026 20:22
@openshift-ci openshift-ci bot removed the lgtm label Feb 2, 2026
Copy link
Contributor

@hugares hugares left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@openshift-ci openshift-ci bot added the lgtm label Feb 2, 2026
@openshift-ci
Copy link

openshift-ci bot commented Feb 2, 2026

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: hugares, oswcab

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@konflux-ci-qe-bot
Copy link

🤖 Pipeline Failure Analysis

Category: Infrastructure

The pipeline failed due to a DNS resolution error preventing the redhat-appstudio-gather step from connecting to the Kubernetes API server, indicating a failure in the test environment's network infrastructure.

📋 Technical Details

Immediate Cause

The redhat-appstudio-gather step failed because the oc client and other tools were unable to resolve the DNS name for the Kubernetes API server (api.konflux-4-17-us-west-2-8vxm4.konflux-qe.devcluster.openshift.com) using the cluster's internal DNS resolver (172.30.0.10:53). This resulted in "no such host" errors and subsequent connection timeouts.

Contributing Factors

Several "gather" steps, including gather-audit-logs, gather-extra, and gather-must-gather, experienced similar DNS resolution failures and network timeouts when attempting to interact with the cluster's API server. This suggests a systemic issue with network connectivity or DNS services within the test cluster environment, rather than an isolated incident. The redhat-appstudio-e2e step's failure with a process termination error is likely a secondary effect of this underlying infrastructure instability.

Impact

The inability to connect to the Kubernetes API server prevented the necessary diagnostic data from being collected by the redhat-appstudio-gather step. This failure, along with similar failures in other gather steps, indicates that the test environment was not operational, which consequently blocked the successful execution of the end-to-end tests and the completion of the pipeline.

🔍 Evidence

appstudio-e2e-tests/gather-audit-logs

Category: infrastructure
Root Cause: The must-gather tool failed to collect audit logs due to a DNS resolution error ("no such host") and a network timeout when attempting to connect to the OpenShift API server.

Logs:

artifacts/appstudio-e2e-tests/gather-audit-logs/build-log.txt line 5
[must-gather      ] OUT 2026-02-02T20:27:24.422704923Z Get "https://api.konflux-4-17-us-west-2-8vxm4.konflux-qe.devcluster.openshift.com:6443/apis/image.openshift.io/v1/namespaces/openshift/imagestreams/must-gather": dial tcp: lookup api.konflux-4-17-us-west-2-8vxm4.konflux-qe.devcluster.openshift.com on 172.30.0.10:53: no such host
artifacts/appstudio-e2e-tests/gather-audit-logs/build-log.txt line 20
error getting cluster version: Get "https://api.konflux-4-17-us-west-2-8vxm4.konflux-qe.devcluster.openshift.com:6443/apis/config.openshift.io/v1/clusterversions/version": dial tcp: lookup api.konflux-4-17-us-west-2-8vxm4.konflux-qe.devcluster.openshift.com on 172.30.0.10:53: no such host
artifacts/appstudio-e2e-tests/gather-audit-logs/build-log.txt line 26
error getting cluster operators: Get "https://api.konflux-4-17-us-west-2-8vxm4.konflux-qe.devcluster.openshift.com:6443/apis/config.openshift.io/v1/clusteroperators": dial tcp: lookup api.konflux-4-17-us-west-2-8vxm4.konflux-qe.devcluster.openshift.com on 172.30.0.10:53: no such host
artifacts/appstudio-e2e-tests/gather-audit-logs/build-log.txt line 38
Error running must-gather collection:
    creating temp namespace: Post "https://api.konflux-4-17-us-west-2-8vxm4.konflux-qe.devcluster.openshift.com:6443/api/v1/namespaces": dial tcp [REDACTED: Public IP (ipv4)]: i/o timeout
artifacts/appstudio-e2e-tests/gather-audit-logs/build-log.txt line 87
error running backup collection: Get "https://api.konflux-4-17-us-west-2-8vxm4.konflux-qe.devcluster.openshift.com:6443/api?timeout=32s": dial tcp: lookup api.konflux-4-17-us-west-2-8vxm4.konflux-qe.devcluster.openshift.com on 172.30.0.10:53: no such host
artifacts/appstudio-e2e-tests/gather-audit-logs/build-log.txt line 110
error getting cluster operators: Get "https://api.konflux-4-17-us-west-2-8vxm4.konflux-qe.devcluster.openshift.com:6443/apis/config.openshift.io/v1/clusteroperators": dial tcp: lookup api.konflux-4-17-us-west-2-8vxm4.konflux-qe.devcluster.openshift.com on 172.30.0.10:53: no such host
artifacts/appstudio-e2e-tests/gather-audit-logs/build-log.txt line 115
error: creating temp namespace: Post "https://api.konflux-4-17-us-west-2-8vxm4.konflux-qe.devcluster.openshift.com:6443/api/v1/namespaces": dial tcp [REDACTED: Public IP (ipv4)]: i/o timeout

appstudio-e2e-tests/gather-extra

Category: infrastructure
Root Cause: The gather-extra step failed due to a DNS resolution error when trying to connect to the Kubernetes API server. This suggests a problem with the network configuration or the DNS service within the cluster environment.

Logs:

appstudio-e2e-tests/gather-extra/artifacts/appstudio-e2e-tests/gather-extra/build-log.txt line 4
E0202 20:27:17.621364      29 memcache.go:265] couldn't get current server API group list: Get "https://api.konflux-4-17-us-west-2-8vxm4.konflux-qe.devcluster.openshift.com:6443/api?timeout=5s": net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)
appstudio-e2e-tests/gather-extra/artifacts/appstudio-e2e-tests/gather-extra/build-log.txt line 5
E0202 20:27:17.632734      29 memcache.go:265] couldn't get current server API group list: Get "https://api.konflux-4-17-us-west-2-8vxm4.konflux-qe.devcluster.openshift.com:6443/api?timeout=5s": dial tcp: lookup api.konflux-4-17-us-west-2-8vxm4.konflux-qe.devcluster.openshift.com on 172.30.0.10:53: no such host
appstudio-e2e-tests/gather-extra/artifacts/appstudio-e2e-tests/gather-extra/build-log.txt line 11
Unable to connect to the server: dial tcp: lookup api.konflux-4-17-us-west-2-8vxm4.konflux-qe.devcluster.openshift.com on 172.30.0.10:53: no such host

appstudio-e2e-tests/gather-must-gather

Category: infrastructure
Root Cause: The failure was caused by network connectivity issues, specifically an I/O timeout when connecting to the Kubernetes API server and subsequent DNS resolution failures, preventing the must-gather tool from collecting data.

Logs:

artifacts/appstudio-e2e-tests/gather-must-gather/must-gather.log line 12
Error running must-gather collection:
    creating temp namespace: Post "https://api.konflux-4-17-us-west-2-8vxm4.konflux-qe.devcluster.openshift.com:6443/api/v1/namespaces": dial tcp [REDACTED: Public IP (ipv4)]: i/o timeout
artifacts/appstudio-e2e-tests/gather-must-gather/must-gather.log line 22
E0202 20:27:05.495764      57 memcache.go:265] couldn't get current server API group list: Get "https://api.konflux-4-17-us-west-2-8vxm4.konflux-qe.devcluster.openshift.com:6443/api?timeout=32s": dial tcp: lookup api.konflux-4-17-us-west-2-8vxm4.konflux-qe.devcluster.openshift.com on 172.30.0.10:53: no such host
artifacts/appstudio-e2e-tests/gather-must-gather/must-gather.log line 38
error running backup collection: Get "https://api.konflux-4-17-us-west-2-8vxm4.konflux-qe.devcluster.openshift.com:6443/api?timeout=32s": dial tcp: lookup api.konflux-4-17-us-west-2-8vxm4.konflux-qe.devcluster.openshift.com on 172.30.0.10:53: no such host
artifacts/appstudio-e2e-tests/gather-must-gather/must-gather.log line 39
error: creating temp namespace: Post "https://api.konflux-4-17-us-west-2-8vxm4.konflux-qe.devcluster.openshift.com:6443/api/v1/namespaces": dial tcp [REDACTED: Public IP (ipv4)]: i/o timeout

appstudio-e2e-tests/redhat-appstudio-e2e

Category: infrastructure
Root Cause: The e2e tests failed because the main process (make ci/test/e2e) was terminated unexpectedly, likely due to an external signal or an unhandled error, leading to a cascade of entrypoint errors related to process termination.

appstudio-e2e-tests/redhat-appstudio-gather

Category: infrastructure
Root Cause: The primary cause of failure is a DNS resolution issue preventing the oc client from connecting to the Kubernetes API server. The hostname api.konflux-4-17-us-west-2-8vxm4.konflux-qe.devcluster.openshift.com could not be resolved.

Logs:

artifacts/appstudio-e2e-tests__redhat-appstudio-gather/step.log line 77
E0202 20:28:01.096506      55 memcache.go:265] couldn't get current server API group list: Get "https://api.konflux-4-17-us-west-2-8vxm4.konflux-qe.devcluster.openshift.com:6443/api?timeout=5s": dial tcp: lookup api.konflux-4-17-us-west-2-8vxm4.konflux-qe.devcluster.openshift.com on 172.30.0.10:53: no such host
artifacts/appstudio-e2e-tests__redhat-appstudio-gather/step.log line 1241
Unable to connect to the server: dial tcp: lookup api.konflux-4-17-us-west-2-8vxm4.konflux-qe.devcluster.openshift.com on 172.30.0.10:53: no such host
artifacts/appstudio-e2e-tests__redhat-appstudio-gather/step.log line 1979
Error running running must-gather collection:    creating temp namespace: Post "https://api.konflux-4-17-us-west-2-8vxm4.konflux-qe.devcluster.openshift.com:6443/api/v1/namespaces": dial tcp: lookup api.konflux-4-17-us-west-2-8vxm4.konflux-qe.devcluster.openshift.com on 172.30.0.10:53: no such host
artifacts/appstudio-e2e-tests__redhat-appstudio-gather/step.log line 2320
E0202 20:28:02.320784    1138 memcache.go:265] couldn't get current server API group list: Get "https://api.konflux-4-17-us-west-2-8vxm4.konflux-qe.devcluster.openshift.com:6443/api?timeout=32s": dial tcp: lookup api.konflux-4-17-us-west-2-8vxm4.konflux-qe.devcluster.openshift.com on 172.30.0.10:53: no such host

Analysis powered by prow-failure-analysis | Build: 2018398374465638400

@oswcab
Copy link
Contributor Author

oswcab commented Feb 2, 2026

/unhold

@oswcab
Copy link
Contributor Author

oswcab commented Feb 2, 2026

/retest

@konflux-ci-qe-bot
Copy link

🤖 Pipeline Failure Analysis

Category: Timeout

The appstudio-e2e-tests job failed due to a timeout, likely caused by underlying cluster instability and synchronization issues within Argo CD and Tekton components, which prevented the tests from completing within the allocated time.

📋 Technical Details

Immediate Cause

The appstudio-e2e-tests step in the Prow job pull-ci-redhat-appstudio-infra-deployments-main-appstudio-e2e-tests timed out after 2 hours, indicating that the end-to-end test suite could not complete its execution within the allowed duration.

Contributing Factors

Analysis of the provided context reveals several potential contributing factors to the test execution failure:

  • Argo CD Synchronization Issues: Several Argo CD ApplicationSets are in an 'OutOfSync' state, and one critical component (build-service-in-cluster-local) is 'Degraded'. This indicates that the cluster's desired state is not being consistently applied, potentially affecting the services the e2e tests interact with.
  • Tekton Component Errors: The TektonConfig and TektonAddon are in error states, specifically related to the readiness of the tkn-cli-serve deployment. Issues with Tekton, the underlying CI/CD engine, can disrupt the execution of the pipelines that the e2e tests are designed to validate.
  • Cluster Instability: The presence of degraded Argo CD applications and synchronization issues across multiple ApplicationSets points towards a general instability or unresponsiveness in the cluster environment where the tests are being executed.

Impact

The timeout of the appstudio-e2e-tests step prevented the completion of the end-to-end validation for the AppStudio infrastructure. This directly impacts the confidence in the deployed infrastructure's stability and functionality, as the tests designed to verify its operational status could not be successfully executed.

🔍 Evidence

appstudio-e2e-tests/redhat-appstudio-e2e

Category: timeout
Root Cause: The appstudio-e2e-tests job timed out, indicating that the end-to-end tests did not complete within the allocated time. This could be due to test instability, resource contention, or an actual failure in the application under test preventing tests from completing.

Logs:

artifacts/appstudio-e2e-tests/redhat-appstudio-e2e/build-log.txt line 1214
{"component":"entrypoint","file":"sigs.k8s.io/prow/pkg/entrypoint/run.go:169","func":"sigs.k8s.io/prow/pkg/entrypoint.Options.ExecuteProcess","level":"error","msg":"Process did not finish before 2h0m0s timeout","severity":"error","time":"2026-02-02T22:25:23Z"}
artifacts/appstudio-e2e-tests/redhat-appstudio-e2e/build-log.txt line 1216
{"component":"entrypoint","file":"sigs.k8s.io/prow/pkg/entrypoint/run.go:267","func":"sigs.k8s.io/prow/pkg/entrypoint.gracefullyTerminate","level":"error","msg":"Process did not exit before 15s grace period","severity":"error","time":"2026-02-02T22:25:38Z"}

Analysis powered by prow-failure-analysis | Build: 2018419758877118464

@oswcab
Copy link
Contributor Author

oswcab commented Feb 2, 2026

/test appstudio-e2e-tests

@konflux-ci-qe-bot
Copy link

🤖 Pipeline Failure Analysis

Category: Timeout

The appstudio-e2e-tests step in the appstudio-e2e-tests job timed out due to exceeding the maximum allowed execution time.

📋 Technical Details

Immediate Cause

The appstudio-e2e-tests/redhat-appstudio-e2e step failed because it exceeded the 2-hour timeout limit set for its execution. The process did not finish within the allocated time and was terminated.

Contributing Factors

The additional_context reveals several potential contributing factors that may have led to the extended test execution time. These include:

  • Multiple Argo CD ApplicationSet resources being in an OutOfSync state, indicating potential deployment or synchronization issues.
  • A build-service-controller-manager deployment showing a Degraded health status.
  • The TektonAddon resource being in an Error state due to the tkn-cli-serve deployment not being ready, and TektonConfig reporting Error status for ComponentsReady and Ready conditions. These point to underlying issues with Tekton component availability or configuration, which could impact test execution.

Impact

The timeout failure prevented the completion of the end-to-end tests for the AppStudio infrastructure deployments. This means that the pipeline could not validate the functionality and stability of the deployed components, potentially delaying the integration of changes from PR #10345.

🔍 Evidence

appstudio-e2e-tests/redhat-appstudio-e2e

Category: timeout
Root Cause: The end-to-end tests exceeded the maximum execution time limit of 2 hours, causing the step to time out and fail.

Logs:

artifacts/appstudio-e2e-tests/redhat-appstudio-e2e/steps-logs/step-logs.txt line 605
{"component":"entrypoint","file":"sigs.k8s.io/prow/pkg/entrypoint/run.go:169","func":"sigs.k8s.io/prow/pkg/entrypoint.Options.ExecuteProcess","level":"error","msg":"Process did not finish before 2h0m0s timeout","severity":"error","time":"2026-02-03T01:05:01Z"}
artifacts/appstudio-e2e-tests/redhat-appstudio-e2e/steps-logs/step-logs.txt line 609
{"component":"entrypoint","file":"sigs.k8s.io/prow/pkg/entrypoint/run.go:267","func":"sigs.k8s.io/prow/pkg/entrypoint.gracefullyTerminate","level":"error","msg":"Process did not exit before 15s grace period","severity":"error","time":"2026-02-03T01:05:16Z"}

Analysis powered by prow-failure-analysis | Build: 2018459904871763968

@openshift-ci
Copy link

openshift-ci bot commented Feb 3, 2026

@oswcab: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/appstudio-e2e-tests d3c98e3 link true /test appstudio-e2e-tests

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants