fix: correct ServiceMonitor selectors after OLM to Helm migration #10345

oswcab · 2026-02-02T18:56:19Z

After migrating from OLM to Helm deployment, custom-resources.yaml was retained but never updated with correct selectors. This caused the Service and ServiceMonitor selecting no pods because of a selector mismatch.

The root cause was the OLM deployment used the instance label 'app.kubernetes.io/instance: cluster' while Helm uses instance label 'app.kubernetes.io/instance: external-secrets-operator'.

To maximize maintainability, this commit uses a hybrid approach where the metric services are enabled on the values.yaml file so that Helm creates the services with correct selectors automatically. For the ServiceMonitors, they are created using a patch as 'kustomize --enable-helm' doesn't support Capabilities.APIVersions check, the condition in the Helm Charts to be able to enable them.

Contributes to: KFLUXINFRA-2513

github-actions · 2026-02-02T18:56:32Z

🤖 Gemini AI Assistant Available

Hi @oswcab! I'm here to help with your pull request. You can interact with me using the following commands:

Available Commands

@gemini-cli /review - Request a comprehensive code review
- Example: @gemini-cli /review Please focus on security and performance
@gemini-cli <your question> - Ask me anything about the codebase
- Example: @gemini-cli How can I improve this function?
- Example: @gemini-cli What are the best practices for error handling here?

How to Use

Simply type one of the commands above in a comment on this PR
I'll analyze your code and provide detailed feedback
You can track my progress in the workflow logs

Permissions

Only OWNER, MEMBER, or COLLABORATOR users can trigger my responses. This ensures secure and appropriate usage.

This message was automatically added to help you get started with the Gemini AI assistant. Feel free to delete this comment if you don't need assistance.

github-actions · 2026-02-02T18:56:33Z

🤖 Hi @oswcab, I've received your request, and I'm working on it now! You can track my progress in the logs for more details.

components/external-secrets-operator/base/values.yaml

hugares

/lgtm

oswcab · 2026-02-02T19:16:44Z

/hold

After migrating from OLM to Helm deployment, custom-resources.yaml was retained but never updated with correct selectors. This caused the Service and ServiceMonitor selecting no pods because of a selector mismatch. The root cause was the OLM deployment used the instance label 'app.kubernetes.io/instance: cluster' while Helm uses instance label 'app.kubernetes.io/instance: external-secrets-operator'. To maximize maintainability, this commit uses a hybrid approach where the metric services are enabled on the values.yaml file so that Helm creates the services with correct selectors automatically. For the ServiceMonitors, they are created using a patch as 'kustomize --enable-helm' doesn't support Capabilities.APIVersions check, the condition in the Helm Charts to be able to enable them. Contributes to: KFLUXINFRA-2513

hugares

/lgtm

openshift-ci · 2026-02-02T20:23:43Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: hugares, oswcab

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details

Needs approval from an approver in each of these files:

~~components/external-secrets-operator/OWNERS~~ [hugares,oswcab]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

konflux-ci-qe-bot · 2026-02-02T20:32:30Z

🤖 Pipeline Failure Analysis

Category: Infrastructure

The pipeline failed due to a DNS resolution error preventing the redhat-appstudio-gather step from connecting to the Kubernetes API server, indicating a failure in the test environment's network infrastructure.

📋 Technical Details

Immediate Cause

The redhat-appstudio-gather step failed because the oc client and other tools were unable to resolve the DNS name for the Kubernetes API server (api.konflux-4-17-us-west-2-8vxm4.konflux-qe.devcluster.openshift.com) using the cluster's internal DNS resolver (172.30.0.10:53). This resulted in "no such host" errors and subsequent connection timeouts.

Contributing Factors

Several "gather" steps, including gather-audit-logs, gather-extra, and gather-must-gather, experienced similar DNS resolution failures and network timeouts when attempting to interact with the cluster's API server. This suggests a systemic issue with network connectivity or DNS services within the test cluster environment, rather than an isolated incident. The redhat-appstudio-e2e step's failure with a process termination error is likely a secondary effect of this underlying infrastructure instability.

Impact

The inability to connect to the Kubernetes API server prevented the necessary diagnostic data from being collected by the redhat-appstudio-gather step. This failure, along with similar failures in other gather steps, indicates that the test environment was not operational, which consequently blocked the successful execution of the end-to-end tests and the completion of the pipeline.

🔍 Evidence

appstudio-e2e-tests/gather-audit-logs

Category: infrastructure
Root Cause: The must-gather tool failed to collect audit logs due to a DNS resolution error ("no such host") and a network timeout when attempting to connect to the OpenShift API server.

Logs:

artifacts/appstudio-e2e-tests/gather-audit-logs/build-log.txt line 5

[must-gather      ] OUT 2026-02-02T20:27:24.422704923Z Get "https://api.konflux-4-17-us-west-2-8vxm4.konflux-qe.devcluster.openshift.com:6443/apis/image.openshift.io/v1/namespaces/openshift/imagestreams/must-gather": dial tcp: lookup api.konflux-4-17-us-west-2-8vxm4.konflux-qe.devcluster.openshift.com on 172.30.0.10:53: no such host

artifacts/appstudio-e2e-tests/gather-audit-logs/build-log.txt line 20

error getting cluster version: Get "https://api.konflux-4-17-us-west-2-8vxm4.konflux-qe.devcluster.openshift.com:6443/apis/config.openshift.io/v1/clusterversions/version": dial tcp: lookup api.konflux-4-17-us-west-2-8vxm4.konflux-qe.devcluster.openshift.com on 172.30.0.10:53: no such host

artifacts/appstudio-e2e-tests/gather-audit-logs/build-log.txt line 26

error getting cluster operators: Get "https://api.konflux-4-17-us-west-2-8vxm4.konflux-qe.devcluster.openshift.com:6443/apis/config.openshift.io/v1/clusteroperators": dial tcp: lookup api.konflux-4-17-us-west-2-8vxm4.konflux-qe.devcluster.openshift.com on 172.30.0.10:53: no such host

artifacts/appstudio-e2e-tests/gather-audit-logs/build-log.txt line 38

Error running must-gather collection:
    creating temp namespace: Post "https://api.konflux-4-17-us-west-2-8vxm4.konflux-qe.devcluster.openshift.com:6443/api/v1/namespaces": dial tcp [REDACTED: Public IP (ipv4)]: i/o timeout

artifacts/appstudio-e2e-tests/gather-audit-logs/build-log.txt line 87

error running backup collection: Get "https://api.konflux-4-17-us-west-2-8vxm4.konflux-qe.devcluster.openshift.com:6443/api?timeout=32s": dial tcp: lookup api.konflux-4-17-us-west-2-8vxm4.konflux-qe.devcluster.openshift.com on 172.30.0.10:53: no such host

artifacts/appstudio-e2e-tests/gather-audit-logs/build-log.txt line 110

error getting cluster operators: Get "https://api.konflux-4-17-us-west-2-8vxm4.konflux-qe.devcluster.openshift.com:6443/apis/config.openshift.io/v1/clusteroperators": dial tcp: lookup api.konflux-4-17-us-west-2-8vxm4.konflux-qe.devcluster.openshift.com on 172.30.0.10:53: no such host

artifacts/appstudio-e2e-tests/gather-audit-logs/build-log.txt line 115

error: creating temp namespace: Post "https://api.konflux-4-17-us-west-2-8vxm4.konflux-qe.devcluster.openshift.com:6443/api/v1/namespaces": dial tcp [REDACTED: Public IP (ipv4)]: i/o timeout

appstudio-e2e-tests/gather-extra

Category: infrastructure
Root Cause: The gather-extra step failed due to a DNS resolution error when trying to connect to the Kubernetes API server. This suggests a problem with the network configuration or the DNS service within the cluster environment.

Logs:

appstudio-e2e-tests/gather-extra/artifacts/appstudio-e2e-tests/gather-extra/build-log.txt line 4

E0202 20:27:17.621364      29 memcache.go:265] couldn't get current server API group list: Get "https://api.konflux-4-17-us-west-2-8vxm4.konflux-qe.devcluster.openshift.com:6443/api?timeout=5s": net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)

appstudio-e2e-tests/gather-extra/artifacts/appstudio-e2e-tests/gather-extra/build-log.txt line 5

E0202 20:27:17.632734      29 memcache.go:265] couldn't get current server API group list: Get "https://api.konflux-4-17-us-west-2-8vxm4.konflux-qe.devcluster.openshift.com:6443/api?timeout=5s": dial tcp: lookup api.konflux-4-17-us-west-2-8vxm4.konflux-qe.devcluster.openshift.com on 172.30.0.10:53: no such host

appstudio-e2e-tests/gather-extra/artifacts/appstudio-e2e-tests/gather-extra/build-log.txt line 11

Unable to connect to the server: dial tcp: lookup api.konflux-4-17-us-west-2-8vxm4.konflux-qe.devcluster.openshift.com on 172.30.0.10:53: no such host

appstudio-e2e-tests/gather-must-gather

Category: infrastructure
Root Cause: The failure was caused by network connectivity issues, specifically an I/O timeout when connecting to the Kubernetes API server and subsequent DNS resolution failures, preventing the must-gather tool from collecting data.

Logs:

artifacts/appstudio-e2e-tests/gather-must-gather/must-gather.log line 12

Error running must-gather collection:
    creating temp namespace: Post "https://api.konflux-4-17-us-west-2-8vxm4.konflux-qe.devcluster.openshift.com:6443/api/v1/namespaces": dial tcp [REDACTED: Public IP (ipv4)]: i/o timeout

artifacts/appstudio-e2e-tests/gather-must-gather/must-gather.log line 22

E0202 20:27:05.495764      57 memcache.go:265] couldn't get current server API group list: Get "https://api.konflux-4-17-us-west-2-8vxm4.konflux-qe.devcluster.openshift.com:6443/api?timeout=32s": dial tcp: lookup api.konflux-4-17-us-west-2-8vxm4.konflux-qe.devcluster.openshift.com on 172.30.0.10:53: no such host

artifacts/appstudio-e2e-tests/gather-must-gather/must-gather.log line 38

error running backup collection: Get "https://api.konflux-4-17-us-west-2-8vxm4.konflux-qe.devcluster.openshift.com:6443/api?timeout=32s": dial tcp: lookup api.konflux-4-17-us-west-2-8vxm4.konflux-qe.devcluster.openshift.com on 172.30.0.10:53: no such host

artifacts/appstudio-e2e-tests/gather-must-gather/must-gather.log line 39

error: creating temp namespace: Post "https://api.konflux-4-17-us-west-2-8vxm4.konflux-qe.devcluster.openshift.com:6443/api/v1/namespaces": dial tcp [REDACTED: Public IP (ipv4)]: i/o timeout

appstudio-e2e-tests/redhat-appstudio-e2e

Category: infrastructure
Root Cause: The e2e tests failed because the main process (make ci/test/e2e) was terminated unexpectedly, likely due to an external signal or an unhandled error, leading to a cascade of entrypoint errors related to process termination.

appstudio-e2e-tests/redhat-appstudio-gather

Category: infrastructure
Root Cause: The primary cause of failure is a DNS resolution issue preventing the oc client from connecting to the Kubernetes API server. The hostname api.konflux-4-17-us-west-2-8vxm4.konflux-qe.devcluster.openshift.com could not be resolved.

Logs:

artifacts/appstudio-e2e-tests__redhat-appstudio-gather/step.log line 77

E0202 20:28:01.096506      55 memcache.go:265] couldn't get current server API group list: Get "https://api.konflux-4-17-us-west-2-8vxm4.konflux-qe.devcluster.openshift.com:6443/api?timeout=5s": dial tcp: lookup api.konflux-4-17-us-west-2-8vxm4.konflux-qe.devcluster.openshift.com on 172.30.0.10:53: no such host

artifacts/appstudio-e2e-tests__redhat-appstudio-gather/step.log line 1241

Unable to connect to the server: dial tcp: lookup api.konflux-4-17-us-west-2-8vxm4.konflux-qe.devcluster.openshift.com on 172.30.0.10:53: no such host

artifacts/appstudio-e2e-tests__redhat-appstudio-gather/step.log line 1979

Error running running must-gather collection:    creating temp namespace: Post "https://api.konflux-4-17-us-west-2-8vxm4.konflux-qe.devcluster.openshift.com:6443/api/v1/namespaces": dial tcp: lookup api.konflux-4-17-us-west-2-8vxm4.konflux-qe.devcluster.openshift.com on 172.30.0.10:53: no such host

artifacts/appstudio-e2e-tests__redhat-appstudio-gather/step.log line 2320

E0202 20:28:02.320784    1138 memcache.go:265] couldn't get current server API group list: Get "https://api.konflux-4-17-us-west-2-8vxm4.konflux-qe.devcluster.openshift.com:6443/api?timeout=32s": dial tcp: lookup api.konflux-4-17-us-west-2-8vxm4.konflux-qe.devcluster.openshift.com on 172.30.0.10:53: no such host

Analysis powered by prow-failure-analysis | Build: 2018398374465638400

oswcab · 2026-02-02T21:23:07Z

/unhold

oswcab · 2026-02-02T21:24:14Z

/retest

konflux-ci-qe-bot · 2026-02-02T22:57:57Z

🤖 Pipeline Failure Analysis

Category: Timeout

The appstudio-e2e-tests job failed due to a timeout, likely caused by underlying cluster instability and synchronization issues within Argo CD and Tekton components, which prevented the tests from completing within the allocated time.

📋 Technical Details

Immediate Cause

The appstudio-e2e-tests step in the Prow job pull-ci-redhat-appstudio-infra-deployments-main-appstudio-e2e-tests timed out after 2 hours, indicating that the end-to-end test suite could not complete its execution within the allowed duration.

Contributing Factors

Analysis of the provided context reveals several potential contributing factors to the test execution failure:

Argo CD Synchronization Issues: Several Argo CD ApplicationSets are in an 'OutOfSync' state, and one critical component (build-service-in-cluster-local) is 'Degraded'. This indicates that the cluster's desired state is not being consistently applied, potentially affecting the services the e2e tests interact with.
Tekton Component Errors: The TektonConfig and TektonAddon are in error states, specifically related to the readiness of the tkn-cli-serve deployment. Issues with Tekton, the underlying CI/CD engine, can disrupt the execution of the pipelines that the e2e tests are designed to validate.
Cluster Instability: The presence of degraded Argo CD applications and synchronization issues across multiple ApplicationSets points towards a general instability or unresponsiveness in the cluster environment where the tests are being executed.

Impact

The timeout of the appstudio-e2e-tests step prevented the completion of the end-to-end validation for the AppStudio infrastructure. This directly impacts the confidence in the deployed infrastructure's stability and functionality, as the tests designed to verify its operational status could not be successfully executed.

🔍 Evidence

appstudio-e2e-tests/redhat-appstudio-e2e

Category: timeout
Root Cause: The appstudio-e2e-tests job timed out, indicating that the end-to-end tests did not complete within the allocated time. This could be due to test instability, resource contention, or an actual failure in the application under test preventing tests from completing.

Logs:

artifacts/appstudio-e2e-tests/redhat-appstudio-e2e/build-log.txt line 1214

{"component":"entrypoint","file":"sigs.k8s.io/prow/pkg/entrypoint/run.go:169","func":"sigs.k8s.io/prow/pkg/entrypoint.Options.ExecuteProcess","level":"error","msg":"Process did not finish before 2h0m0s timeout","severity":"error","time":"2026-02-02T22:25:23Z"}

artifacts/appstudio-e2e-tests/redhat-appstudio-e2e/build-log.txt line 1216

{"component":"entrypoint","file":"sigs.k8s.io/prow/pkg/entrypoint/run.go:267","func":"sigs.k8s.io/prow/pkg/entrypoint.gracefullyTerminate","level":"error","msg":"Process did not exit before 15s grace period","severity":"error","time":"2026-02-02T22:25:38Z"}

Analysis powered by prow-failure-analysis | Build: 2018419758877118464

oswcab · 2026-02-02T23:01:54Z

/test appstudio-e2e-tests

konflux-ci-qe-bot · 2026-02-03T01:37:34Z

🤖 Pipeline Failure Analysis

Category: Timeout

The appstudio-e2e-tests step in the appstudio-e2e-tests job timed out due to exceeding the maximum allowed execution time.

📋 Technical Details

Immediate Cause

The appstudio-e2e-tests/redhat-appstudio-e2e step failed because it exceeded the 2-hour timeout limit set for its execution. The process did not finish within the allocated time and was terminated.

Contributing Factors

The additional_context reveals several potential contributing factors that may have led to the extended test execution time. These include:

Multiple Argo CD ApplicationSet resources being in an OutOfSync state, indicating potential deployment or synchronization issues.
A build-service-controller-manager deployment showing a Degraded health status.
The TektonAddon resource being in an Error state due to the tkn-cli-serve deployment not being ready, and TektonConfig reporting Error status for ComponentsReady and Ready conditions. These point to underlying issues with Tekton component availability or configuration, which could impact test execution.

Impact

The timeout failure prevented the completion of the end-to-end tests for the AppStudio infrastructure deployments. This means that the pipeline could not validate the functionality and stability of the deployed components, potentially delaying the integration of changes from PR #10345.

🔍 Evidence

appstudio-e2e-tests/redhat-appstudio-e2e

Category: timeout
Root Cause: The end-to-end tests exceeded the maximum execution time limit of 2 hours, causing the step to time out and fail.

Logs:

artifacts/appstudio-e2e-tests/redhat-appstudio-e2e/steps-logs/step-logs.txt line 605

{"component":"entrypoint","file":"sigs.k8s.io/prow/pkg/entrypoint/run.go:169","func":"sigs.k8s.io/prow/pkg/entrypoint.Options.ExecuteProcess","level":"error","msg":"Process did not finish before 2h0m0s timeout","severity":"error","time":"2026-02-03T01:05:01Z"}

artifacts/appstudio-e2e-tests/redhat-appstudio-e2e/steps-logs/step-logs.txt line 609

{"component":"entrypoint","file":"sigs.k8s.io/prow/pkg/entrypoint/run.go:267","func":"sigs.k8s.io/prow/pkg/entrypoint.gracefullyTerminate","level":"error","msg":"Process did not exit before 15s grace period","severity":"error","time":"2026-02-03T01:05:16Z"}

Analysis powered by prow-failure-analysis | Build: 2018459904871763968

openshift-ci · 2026-02-03T01:37:42Z

@oswcab: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name	Commit	Details	Required	Rerun command
ci/prow/appstudio-e2e-tests	`d3c98e3`	link	true	`/test appstudio-e2e-tests`

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

openshift-ci bot requested review from eisraeli and sadlerap February 2, 2026 18:56

openshift-ci bot added the approved label Feb 2, 2026

oswcab force-pushed the fix/eso-servicemonitor-selectors branch from 303e63f to 3f354df Compare February 2, 2026 18:57

hugares reviewed Feb 2, 2026

View reviewed changes

components/external-secrets-operator/base/values.yaml Show resolved Hide resolved

hugares approved these changes Feb 2, 2026

View reviewed changes

openshift-ci bot assigned hugares Feb 2, 2026

openshift-ci bot added the lgtm label Feb 2, 2026

openshift-ci bot added the do-not-merge/hold label Feb 2, 2026

oswcab force-pushed the fix/eso-servicemonitor-selectors branch from 3f354df to d3c98e3 Compare February 2, 2026 20:22

openshift-ci bot removed the lgtm label Feb 2, 2026

hugares approved these changes Feb 2, 2026

View reviewed changes

openshift-ci bot added the lgtm label Feb 2, 2026

openshift-ci bot removed the do-not-merge/hold label Feb 2, 2026

fix: correct ServiceMonitor selectors after OLM to Helm migration #10345

Are you sure you want to change the base?

fix: correct ServiceMonitor selectors after OLM to Helm migration #10345

Conversation

oswcab commented Feb 2, 2026 • edited by openshift-ci bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Feb 2, 2026

🤖 Gemini AI Assistant Available

Available Commands

How to Use

Permissions

Uh oh!

github-actions bot commented Feb 2, 2026

Uh oh!

Uh oh!

hugares left a comment

Choose a reason for hiding this comment

Uh oh!

oswcab commented Feb 2, 2026

Uh oh!

hugares left a comment

Choose a reason for hiding this comment

Uh oh!

openshift-ci bot commented Feb 2, 2026

Uh oh!

konflux-ci-qe-bot commented Feb 2, 2026

🤖 Pipeline Failure Analysis

📋 Technical Details

Immediate Cause

Contributing Factors

Impact

appstudio-e2e-tests/gather-audit-logs

appstudio-e2e-tests/gather-extra

appstudio-e2e-tests/gather-must-gather

appstudio-e2e-tests/redhat-appstudio-e2e

appstudio-e2e-tests/redhat-appstudio-gather

Uh oh!

oswcab commented Feb 2, 2026

Uh oh!

oswcab commented Feb 2, 2026

Uh oh!

konflux-ci-qe-bot commented Feb 2, 2026

🤖 Pipeline Failure Analysis

📋 Technical Details

Immediate Cause

Contributing Factors

Impact

appstudio-e2e-tests/redhat-appstudio-e2e

Uh oh!

oswcab commented Feb 2, 2026

Uh oh!

konflux-ci-qe-bot commented Feb 3, 2026

🤖 Pipeline Failure Analysis

📋 Technical Details

Immediate Cause

Contributing Factors

Impact

appstudio-e2e-tests/redhat-appstudio-e2e

Uh oh!

openshift-ci bot commented Feb 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

oswcab commented Feb 2, 2026 •

edited by openshift-ci bot

Loading