Skip to content

[KFLUXINFRA-2805] Enable opentelemetry log collection on MPC Vm's on prod#10388

Open
mshaposhnik wants to merge 3 commits intoredhat-appstudio:mainfrom
mshaposhnik:logcollector_prod
Open

[KFLUXINFRA-2805] Enable opentelemetry log collection on MPC Vm's on prod#10388
mshaposhnik wants to merge 3 commits intoredhat-appstudio:mainfrom
mshaposhnik:logcollector_prod

Conversation

@mshaposhnik
Copy link
Contributor

Turn on MPC log collection on production.
Fixes: https://issues.redhat.com/browse/KFLUXINFRA-2805

Signed-off-by: Max Shaposhnyk <mshaposh@redhat.com>
Signed-off-by: Max Shaposhnyk <mshaposh@redhat.com>
@openshift-ci openshift-ci bot requested review from hugares and meyrevived February 5, 2026 08:37
@openshift-ci
Copy link

openshift-ci bot commented Feb 5, 2026

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: mshaposhnik

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci bot added the approved label Feb 5, 2026
@github-actions
Copy link
Contributor

github-actions bot commented Feb 5, 2026

🤖 Gemini AI Assistant Available

Hi @mshaposhnik! I'm here to help with your pull request. You can interact with me using the following commands:

Available Commands

  • @gemini-cli /review - Request a comprehensive code review

    • Example: @gemini-cli /review Please focus on security and performance
  • @gemini-cli <your question> - Ask me anything about the codebase

    • Example: @gemini-cli How can I improve this function?
    • Example: @gemini-cli What are the best practices for error handling here?

How to Use

  1. Simply type one of the commands above in a comment on this PR
  2. I'll analyze your code and provide detailed feedback
  3. You can track my progress in the workflow logs

Permissions

Only OWNER, MEMBER, or COLLABORATOR users can trigger my responses. This ensures secure and appropriate usage.


This message was automatically added to help you get started with the Gemini AI assistant. Feel free to delete this comment if you don't need assistance.

@github-actions
Copy link
Contributor

github-actions bot commented Feb 5, 2026

🤖 Hi @mshaposhnik, I've received your request, and I'm working on it now! You can track my progress in the logs for more details.

Signed-off-by: Max Shaposhnyk <mshaposh@redhat.com>
@konflux-ci-qe-bot
Copy link

🤖 Pipeline Failure Analysis

Category: Infrastructure

The E2E tests failed because the cluster's Kubernetes API server was unreachable due to DNS resolution errors and network timeouts, preventing essential diagnostic and testing steps from executing.

📋 Technical Details

Immediate Cause

Multiple steps, including gather-audit-logs, gather-extra, gather-must-gather, and redhat-appstudio-gather, failed because they could not resolve the hostname of the Kubernetes API server (api.konflux-4-17-us-west-2-jrr82.konflux-qe.devcluster.openshift.com). This is evidenced by repeated "dial tcp: lookup ... on 172.30.0.10:53: no such host" errors in the logs. Additionally, the gather-must-gather step experienced "i/o timeout" errors when attempting to connect to cluster APIs.

Contributing Factors

The appstudio-e2e-tests/redhat-appstudio-e2e step itself terminated unexpectedly with a "Terminated" signal, followed by a "Process did not exit before 15s grace period" and "Could not kill process after grace period" messages. This termination is likely a downstream effect of the underlying network and DNS issues that prevented communication with the cluster, rather than an independent failure of the test execution logic. The supplemental context also points to network-related errors and potential unreachability of the cluster API endpoint.

Impact

The inability to resolve the cluster API hostname and subsequent network timeouts prevented essential diagnostic tools (like must-gather and oc) from collecting necessary information and establishing a connection to the cluster. This fundamental infrastructure problem directly blocked the execution of the main E2E test suite (redhat-appstudio-e2e) and the collection of further diagnostic data, rendering the entire test run ineffective.

🔍 Evidence

appstudio-e2e-tests/gather-audit-logs

Category: infrastructure
Root Cause: The must-gather tool failed to resolve the hostname of the Kubernetes API server, indicating a DNS resolution problem within the cluster's network.

Logs:

artifacts/appstudio-e2e-tests/gather-audit-logs/run.log line 4
[must-gather      ] OUT 2026-02-05T09:08:09.285583307Z Get "https://api.konflux-4-17-us-west-2-jrr82.konflux-qe.devcluster.openshift.com:6443/apis/image.openshift.io/v1/namespaces/openshift/imagestreams/must-gather": dial tcp: lookup api.konflux-4-17-us-west-2-jrr82.konflux-qe.devcluster.openshift.com on 172.30.0.10:53: no such host
artifacts/appstudio-e2e-tests/gather-audit-logs/run.log line 12
error getting cluster version: Get "https://api.konflux-4-17-us-west-2-jrr82.konflux-qe.devcluster.openshift.com:6443/apis/config.openshift.io/v1/clusterversions/version": dial tcp: lookup api.konflux-4-17-us-west-2-jrr82.konflux-qe.devcluster.openshift.com on 172.30.0.10:53: no such host
artifacts/appstudio-e2e-tests/gather-audit-logs/run.log line 17
error getting cluster operators: Get "https://api.konflux-4-17-us-west-2-jrr82.konflux-qe.devcluster.openshift.com:6443/apis/config.openshift.io/v1/clusteroperators": dial tcp: lookup api.konflux-4-17-us-west-2-jrr82.konflux-qe.devcluster.openshift.com on 172.30.0.10:53: no such host
artifacts/appstudio-e2e-tests/gather-audit-logs/run.log line 25
Error running must-gather collection:
    creating temp namespace: Post "https://api.konflux-4-17-us-west-2-jrr82.konflux-qe.devcluster.openshift.com:6443/api/v1/namespaces": dial tcp: lookup api.konflux-4-17-us-west-2-jrr82.konflux-qe.devcluster.openshift.com on 172.30.0.10:53: no such host
artifacts/appstudio-e2e-tests/gather-audit-logs/run.log line 60
error running backup collection: Get "https://api.konflux-4-17-us-west-2-jrr82.konflux-qe.devcluster.openshift.com:6443/api?timeout=32s": dial tcp: lookup api.konflux-4-17-us-west-2-jrr82.konflux-qe.devcluster.openshift.com on 172.30.0.10:53: no such host

appstudio-e2e-tests/gather-extra

Category: infrastructure
Root Cause: The failure is caused by a DNS resolution issue, where the cluster's API server hostname cannot be resolved by the configured DNS server. This prevents the step from connecting to the cluster to gather artifacts.

Logs:

artifacts/appstudio-e2e-tests/gather-extra/gather-extra.log line 3
E0205 09:08:03.210645      29 memcache.go:265] couldn't get current server API group list: Get "https://api.konflux-4-17-us-west-2-jrr82.konflux-qe.devcluster.openshift.com:6443/api?timeout=5s": dial tcp: lookup api.konflux-4-17-us-west-2-jrr82.konflux-qe.devcluster.openshift.com on 172.30.0.10:53: no such host
artifacts/appstudio-e2e-tests/gather-extra/gather-extra.log line 10
Unable to connect to the server: dial tcp: lookup api.konflux-4-17-us-west-2-jrr82.konflux-qe.devcluster.openshift.com on 172.30.0.10:53: no such host

appstudio-e2e-tests/gather-must-gather

Category: infrastructure
Root Cause: The must-gather tool failed to connect to the OpenShift API server due to network timeouts and DNS resolution errors, indicating an issue with cluster accessibility or network configuration.

Logs:

artifacts/appstudio-e2e-tests/gather-must-gather/step.log line 18
Error running must-gather collection:
    creating temp namespace: Post "https://api.konflux-4-17-us-west-2-jrr82.konflux-qe.devcluster.openshift.com:6443/api/v1/namespaces": dial tcp [REDACTED: Public IP (ipv4)]: i/o timeout
artifacts/appstudio-e2e-tests/gather-must-gather/step.log line 36
E0205 09:07:50.448054      53 memcache.go:265] couldn't get current server API group list: Get "https://api.konflux-4-17-us-west-2-jrr82.konflux-qe.devcluster.openshift.com:6443/api?timeout=32s": dial tcp: lookup api.konflux-4-17-us-west-2-jrr82.konflux-qe.devcluster.openshift.com on 172.30.0.10:53: no such host
artifacts/appstudio-e2e-tests/gather-must-gather/step.log line 45
error running backup collection: Get "https://api.konflux-4-17-us-west-2-jrr82.konflux-qe.devcluster.openshift.com:6443/api?timeout=32s": dial tcp: lookup api.konflux-4-17-us-west-2-jrr82.konflux-qe.devcluster.openshift.com on 172.30.0.10:53: no such host
artifacts/appstudio-e2e-tests/gather-must-gather/step.log line 46
error: creating temp namespace: Post "https://api.konflux-4-17-us-west-2-jrr82.konflux-qe.devcluster.openshift.com:6443/api/v1/namespaces": dial tcp [REDACTED: Public IP (ipv4)]: i/o timeout

appstudio-e2e-tests/redhat-appstudio-e2e

Category: infrastructure
Root Cause: The e2e tests failed due to an unexpected termination of the main process, likely caused by an external signal or a problem within the Kubernetes cluster environment that led to the process being killed after a grace period.

Logs:

artifacts/appstudio-e2e-tests/redhat-appstudio-e2e/build.log line 650
{"component":"entrypoint","file":"sigs.k8s.io/prow/pkg/entrypoint/run.go:173","func":"sigs.k8s.io/prow/pkg/entrypoint.Options.ExecuteProcess","level":"error","msg":"Entrypoint received interrupt: terminated","severity":"error","time":"2026-02-05T09:02:37Z"}
artifacts/appstudio-e2e-tests/redhat-appstudio-e2e/build.log line 651
make: *** [Makefile:25: ci/test/e2e] Terminated
artifacts/appstudio-e2e-tests/redhat-appstudio-e2e/build.log line 653
{"component":"entrypoint","file":"sigs.k8s.io/prow/pkg/entrypoint/run.go:267","func":"sigs.k8s.io/prow/pkg/entrypoint.gracefullyTerminate","level":"error","msg":"Process did not exit before 15s grace period","severity":"error","time":"2026-02-05T09:02:52Z"}
artifacts/appstudio-e2e-tests/redhat-appstudio-e2e/build.log line 654
{"component":"entrypoint","error":"os: process already finished","file":"sigs.k8s.io/prow/pkg/entrypoint/run.go:269","func":"sigs.k8s.io/prow/pkg/entrypoint.gracefullyTerminate","level":"error","msg":"Could not kill process after grace period","severity":"error","time":"2026-02-05T09:02:52Z"}

appstudio-e2e-tests/redhat-appstudio-gather

Category: infrastructure
Root Cause: The oc client is unable to resolve the hostname of the Kubernetes API server, indicating a DNS resolution issue within the cluster network or environment. This prevents the necessary oc commands from executing successfully.

Logs:

artifacts/appstudio-e2e-tests__redhat-appstudio-gather/oc-logs.txt
E0205 09:08:17.394477      45 memcache.go:265] couldn't get current server API group list: Get "https://api.konflux-4-17-us-west-2-jrr82.konflux-qe.devcluster.openshift.com:6443/api?timeout=5s": dial tcp: lookup api.konflux-4-17-us-west-2-jrr82.konflux-qe.devcluster.openshift.com on 172.30.0.10:53: no such host
artifacts/appstudio-e2e-tests__redhat-appstudio-gather/oc-logs.txt
Unable to connect to the server: dial tcp: lookup api.konflux-4-17-us-west-2-jrr82.konflux-qe.devcluster.openshift.com on 172.30.0.10:53: no such host
artifacts/appstudio-e2e-tests__redhat-appstudio-gather/oc-logs.txt
Error running must-gather collection:
    creating temp namespace: Post "https://api.konflux-4-17-us-west-2-jrr82.konflux-qe.devcluster.openshift.com:6443/api/v1/namespaces": dial tcp: lookup api.konflux-4-17-us-west-2-jrr82.konflux-qe.devcluster.openshift.com on 172.30.0.10:53: no such host
artifacts/appstudio-e2e-tests__redhat-appstudio-gather/oc-logs.txt
error running backup collection: Get "https://api.konflux-4-17-us-west-2-jrr82.konflux-qe.devcluster.openshift.com:6443/api?timeout=32s": dial tcp: lookup api.konflux-4-17-us-west-2-jrr82.konflux-qe.devcluster.openshift.com on 172.30.0.10:53: no such host

Analysis powered by prow-failure-analysis | Build: 2019329481784692736

@konflux-ci-qe-bot
Copy link

🤖 Pipeline Failure Analysis

Category: Timeout

The end-to-end tests for AppStudio infrastructure failed due to a timeout caused by prolonged setup and synchronization of Kubernetes resources, including Argo CD.

📋 Technical Details

Immediate Cause

The appstudio-e2e-tests/redhat-appstudio-e2e step in the Prow job exceeded its 2-hour time limit. This timeout prevented the completion of the end-to-end test execution.

Contributing Factors

Several factors indicated in the additional_context may have contributed to the extended setup and synchronization times:

  • Multiple Argo CD ApplicationSets, such as 'application-api', 'build-service', and 'crossplane-control-plane', were in an OutOfSync state.
  • The 'build-service-controller-manager' deployment in the 'build-service' namespace was Degraded due to exceeding its progress deadline.
  • The tektonaddons reported an InstallerSetReady status of False due to issues with the 'tkn-cli-serve' deployment.

Impact

The timeout of the e2e test execution step prevented the successful validation of the AppStudio infrastructure deployment in the test environment, thereby blocking the CI pipeline.

🔍 Evidence

appstudio-e2e-tests/redhat-appstudio-e2e

Category: timeout
Root Cause: The e2e tests timed out because the setup and synchronization processes for various Kubernetes resources and operators, such as Argo CD, took longer than the allocated time limit for the test execution.

Logs:

artifacts/appstudio-e2e-tests/redhat-appstudio-e2e/step.log line 1279
{"component":"entrypoint","file":"sigs.k8s.io/prow/pkg/entrypoint/run.go:169","func":"sigs.k8s.io/prow/pkg/entrypoint.Options.ExecuteProcess","level":"error","msg":"Process did not finish before 2h0m0s timeout","severity":"error","time":"2026-02-05T11:05:47Z"}
artifacts/appstudio-e2e-tests/redhat-appstudio-e2e/step.log line 1281
{"component":"entrypoint","file":"sigs.k8s.io/prow/pkg/entrypoint/run.go:267","func":"sigs.k8s.io/prow/pkg/entrypoint.gracefullyTerminate","level":"error","msg":"Process did not exit before 15s grace period","severity":"error","time":"2026-02-05T11:06:02Z"}

Analysis powered by prow-failure-analysis | Build: 2019335848314540032

@mshaposhnik
Copy link
Contributor Author

/test appstudio-e2e-tests

@konflux-ci-qe-bot
Copy link

🤖 Pipeline Failure Analysis

Category: Timeout

The Prow job timed out during the execution of end-to-end tests for AppStudio.

📋 Technical Details

Immediate Cause

The appstudio-e2e-tests/redhat-appstudio-e2e step of the Prow job pull-ci-redhat-appstudio-infra-deployments-main-appstudio-e2e-tests failed due to a timeout. The job exceeded its 2-hour time limit, leading to termination.

Contributing Factors

Analysis of the additional_context indicates potential underlying infrastructure issues that may have contributed to the prolonged test execution. Specifically, the build-service-controller-manager deployment was in a 'Degraded' state, and several Argo CD ApplicationSets reported errors and out-of-sync resources. These issues could have slowed down the deployment or setup phases of the e2e tests, preventing them from completing within the allotted time. Additionally, the tektonaddons.json artifact shows that TektonAddon components were not ready due to issues with openshift console resources and the 'tkn-cli-serve' deployment.

Impact

The timeout prevented the completion of the end-to-end test suite for the AppStudio infrastructure. This means that the tests designed to validate the functionality and stability of the AppStudio deployment were not executed successfully, leaving the overall health of the deployment unverified for this build.

🔍 Evidence

appstudio-e2e-tests/redhat-appstudio-e2e

Category: timeout
Root Cause: The e2e tests timed out, likely due to a prolonged execution of setup or deployment steps, or an actual test scenario that took too long to complete. The Process did not finish before 2h0m0s timeout and Process did not exit before 15s grace period messages indicate the job was terminated due to exceeding its time limit.

Logs:

artifacts/appstudio-e2e-tests/redhat-appstudio-e2e/build-log.txt line 1384
{"component":"entrypoint","file":"sigs.k8s.io/prow/pkg/entrypoint/run.go:169","func":"sigs.k8s.io/prow/pkg/entrypoint.Options.ExecuteProcess","level":"error","msg":"Process did not finish before 2h0m0s timeout","severity":"error","time":"2026-02-05T14:00:38Z"}
artifacts/appstudio-e2e-tests/redhat-appstudio-e2e/build-log.txt line 1386
{"component":"entrypoint","file":"sigs.k8s.io/prow/pkg/entrypoint/run.go:267","func":"sigs.k8s.io/prow/pkg/entrypoint.gracefullyTerminate","level":"error","msg":"Process did not exit before 15s grace period","severity":"error","time":"2026-02-05T14:00:53Z"}

Analysis powered by prow-failure-analysis | Build: 2019379860396314624

@openshift-ci
Copy link

openshift-ci bot commented Feb 5, 2026

@mshaposhnik: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/appstudio-e2e-tests f57d469 link true /test appstudio-e2e-tests

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants