Skip to content

Conversation

@dis016
Copy link

@dis016 dis016 commented Aug 1, 2025

testing out test case execution with OTE framework not to be merged

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Aug 1, 2025

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: dis016
Once this PR has been reviewed and has the lgtm label, please assign davidhurta for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Aug 1, 2025

@dis016: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/e2e-agnostic-ovn-upgrade-out-of-change ddf71ca link true /test e2e-agnostic-ovn-upgrade-out-of-change
ci/prow/e2e-agnostic-ovn ddf71ca link true /test e2e-agnostic-ovn
ci/prow/lint ddf71ca link true /test lint
ci/prow/e2e-agnostic-ovn-upgrade-into-change ddf71ca link true /test e2e-agnostic-ovn-upgrade-into-change
ci/prow/e2e-aws-ovn-techpreview ddf71ca link true /test e2e-aws-ovn-techpreview
ci/prow/e2e-hypershift ddf71ca link true /test e2e-hypershift
ci/prow/e2e-agnostic-operator ddf71ca link true /test e2e-agnostic-operator
ci/prow/unit ddf71ca link true /test unit
ci/prow/okd-scos-e2e-aws-ovn ddf71ca link false /test okd-scos-e2e-aws-ovn
ci/prow/verify-update ddf71ca link true /test verify-update
ci/prow/e2e-hypershift-conformance ddf71ca link true /test e2e-hypershift-conformance
ci/prow/okd-scos-images ddf71ca link true /test okd-scos-images
ci/prow/verify-deps ddf71ca link true /test verify-deps
ci/prow/images ddf71ca link true /test images
ci/prow/e2e-agnostic-operator-devpreview ddf71ca link false /test e2e-agnostic-operator-devpreview

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@JianLi-RH
Copy link
Contributor

hi @dis016 can we start with a very simple CVO case? I mean let's only copy necessary files from openshift-tests-private to cluster-version-operator? If you don't mind, I want to create another PR for same work. I will also avoid to use OC client (please refer https://redhat-internal.slack.com/archives/CJ1J9C3V4/p1755771964418519?thread_ts=1755602807.015039&cid=CJ1J9C3V4)

@dis016
Copy link
Author

dis016 commented Sep 2, 2025

hi @dis016 can we start with a very simple CVO case? I mean let's only copy necessary files from openshift-tests-private to cluster-version-operator? If you don't mind, I want to create another PR for same work. I will also avoid to use OC client (please refer https://redhat-internal.slack.com/archives/CJ1J9C3V4/p1755771964418519?thread_ts=1755602807.015039&cid=CJ1J9C3V4)

sure @JianLi-RH please go ahead

@openshift-bot
Copy link
Contributor

Issues go stale after 90d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.
Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle stale

@openshift-merge-robot openshift-merge-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Dec 1, 2025
@openshift-merge-robot
Copy link
Contributor

PR needs rebase.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@openshift-ci openshift-ci bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Dec 1, 2025
@coderabbitai
Copy link

coderabbitai bot commented Dec 1, 2025

Walkthrough

Introduces extensive test utility expansion spanning cloud provider clients (AWS, Azure, GCP, IBM Cloud, Nutanix, OpenStack), database drivers (MongoDB, MySQL, PostgreSQL, SQLite), Kubernetes/OpenShift cluster management, machine/node operations, container interactions (Docker, Podman, Quay), bootstrap provisioning, authentication/OAuth orchestration, and general test infrastructure across multiple cloud platforms.

Changes

Cohort / File(s) Summary
CVO Test Refactoring
test/cvo/cvo.go, test/cvo/utils.go
Replaces dot imports with named aliases (ginkgo, gomega, exutil); adds project-scoped setup with CLI helper; reworks test bodies to use explicit alias-based assertions; adds version command test.
Architecture Utilities
test/util/architecture/*
Adds OWNERS file and comprehensive architecture detection package with enum (AMD64, ARM64, PPC64LE, S390X, MULTI, UNKNOWN), skip logic, multi-arch detection, control-plane queries, and string conversions.
Test Assertion Helpers
test/util/assert.go
Introduces polling assertion utilities: AssertWaitPollNoErr, AssertWaitPollWithErr, and generic OrFail helper with error handling and proper assertion offset.
RBAC & Auth Utils
test/util/auth_rule_resolver.go, test/util/ldap.go, test/util/oauthserver/*
Adds RBAC rule resolver, LDAP server provisioning, OAuth server deployment, token command flow with challenge handlers (Basic, Negotiate, Multi), and SSPI/GSSAPI negotiation stubs.
AWS Utilities
test/util/aws_client.go, test/util/bootstrap/aws.go
Comprehensive AWS EC2 client (instances, subnets, VPCs, security groups, DHCP, tags); S3, IAM, Route53, STS, ECR, KMS sub-clients; bootstrap provider for SSH access.
Azure Utilities
test/util/azure_client.go, test/util/azure_client_v2.go, test/util/azure_creds.go, test/util/azure/config_file.go, test/util/bootstrap/azure.go
Azure VM, storage, identities, and role management; credential handling with base64 decoding; config file loading; SDK v2 client set with capacity reservations and Graph integration; bootstrap provider.
GCP & IBM Cloud Utilities
test/util/gcloud_client.go, test/util/ibmcloud_client.go
GCP instance, firewall, storage, deployment manager, VPN operations; IBM Cloud VPC and PowerVS session management, instance lifecycle, credential extraction.
Nutanix & OpenStack Utilities
test/util/nutanix_client.go, test/util/nutanix_restapi.go, test/util/openstack_util.go
Nutanix VM lifecycle via REST API; OpenStack credentials, container/object storage, compute instance management.
Database Utilities
test/util/db/common.go, test/util/db/mongodb.go, test/util/db/mysql.go, test/util/db/postgresql.go, test/util/db/sqlit.go, test/util/db_image_helpers.go
Common pod config retrieval, MongoDB/MySQL/PostgreSQL query interfaces, SQLite operator bundle inspection, polling-based database readiness checks.
Container Utilities
test/util/container/docker_client.go, test/util/container/podman_client.go, test/util/container/quay_client.go
Docker and Podman CLI wrappers with image/container lifecycle management; Quay.io API client for tag operations.
Machine & MachineSet Helpers
test/util/clusterinfra/cluster_helpers.go, test/util/clusterinfra/machine_helpers.go, test/util/clusterinfra/machine_arch.go, test/util/clusterinfra/machineset_helper.go, test/util/clusterinfra/machineset_nonspot.go
Platform detection and credential extraction; machine set operations, listing, scaling; architecture-aware machine creation; labeled/non-spot machine set management; polling for machine/node readiness.
Cluster & Infrastructure Utilities
test/util/cloud/cloud.go, test/util/clusters.go, test/util/clusterinfra_bootstrap.go, test/util/nodes.go
Cloud config derivation from live clusters; cluster version/release/mirror/capability queries; node queries by OS/role; resource allocation maps; debug helpers; platform-specific skips.
Networking & Endpoint Utilities
test/util/networking.go, test/util/endpoints.go
Network type, operator status, IP version stack detection; endpoint availability polling.
CLI & Client Foundation
test/util/client.go, test/util/oc_copy.go
Comprehensive OpenShift CLI wrapper (CLI type) with namespace/project provisioning, client getters, RBAC, token generation, and kubeconfig assembly from REST configs.
Framework & Build Utilities
test/util/framework.go, test/util/ccoctl.go
Extensive test framework: image stream, deployment config, build, pod, service readiness; logging/diagnostics (DumpBuildLogs, DumpPodLogs); build result tracking; Git repo helpers.
Deployment & Pod Utilities
test/util/deployment.go, test/util/deploymentconfigs.go, test/util/pods.go, test/util/nfs.go
Deployment readiness polling, deployment config removal, pod lifecycle, remote shell access, log retrieval, NFS server/volume setup.
Image & Docker Utilities
test/util/docker.go, test/util/image_helpers.go, test/util/release.go
Docker image listing/tagging, image ID lookup with missing-tag error, CoreOS boot image extraction, RHCOS Azure disk URL, latest release image queries.
Monitoring & Metrics
test/util/metrics.go, test/util/prometheus_monitoring.go
Per-pod metrics collection with QPS calculation, Prometheus/Thanos query interfaces, instant/range/rule queries, alert retrieval, metric data extraction.
Jenkins Utilities
test/util/jenkins/ref.go, test/util/jenkins/monitor.go
Jenkins HTTP API wrapper (URI construction, GET/POST, XML templating), job monitoring, GC/memory tracking, console log retrieval, image overrides, DSL job building.
Bootstrap Provisioning
test/util/bootstrap/bootstrap.go
Bootstrap machine abstraction with platform-specific providers, SSH client integration, IP retrieval, error handling for missing instances.
ROSA & HyperShift Utilities
test/util/rosa.go, test/util/hypeshift_cluster.go, test/util/rosacli/*
ROSA detection and login, hosted cluster kubeconfig retrieval, platform-type queries, node pool operations; cluster command/config parsing, cluster service abstraction with YAML/JSON marshalling.
General Utilities
test/util/assert.go, test/util/file.go, test/util/gwrapper.go, test/util/gomega_helpers.go, test/util/logext/log.go, test/util/manifest.go, test/util/must-gather.go, test/util/randstr.go, test/util/resource_op.go, test/util/machine_helpers.go, test/util/disruption/disruption.go
File operations (copy, move, temp); Ginkgo By wrapper with logging; Gomega privacy matcher; zerolog integration; manifest reading/generation; must-gather archival; random strings; resource labeling/annotation; machine API constants; disruption test orchestration.

Estimated code review effort

🎯 5 (Critical) | ⏱️ ~120+ minutes

Key areas requiring extra attention:

  • Cloud provider client implementations (test/util/*_client.go, test/util/*/bootstrap/aws.go, test/util/*/bootstrap/azure.go) — Each cloud provider has distinct API patterns, error handling, and credential management flows; credential handling and authentication must be secure and consistent across providers.
  • Database abstraction layer (test/util/db/) — Multiple database implementations (MongoDB, MySQL, PostgreSQL, SQLite) with pod-based execution; query execution, connection handling, and remote command patterns require verification for correctness and consistency.
  • Machine/MachineSet orchestration (test/util/clusterinfra/machine_*.go) — Complex polling patterns, architecture filtering, and instance-type mapping with multiple cloud provider variations; high risk for race conditions or incomplete state handling.
  • CLI wrapper foundation (test/util/client.go) — Large, multi-method type with extensive state management, namespace provisioning, client initialization, RBAC, and token generation; central to all test execution paths.
  • Authentication/OAuth flow (test/util/oauthserver/tokencmd/) — Challenge handler composition, PKCE/code flow coordination, and TLS/redirect handling; security-critical code requiring careful validation.
  • Heterogeneity of patterns — Similar operations (credential retrieval, instance lifecycle, querying) are implemented across 5+ cloud providers with subtle differences; consistency, error messages, and logging patterns should be validated across all implementations.
  • New public API surface — Large number of exported types, functions, and interfaces introduced; backward compatibility and API stability implications should be reviewed.
✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 28

Note

Due to the large number of review comments, Critical severity comments were prioritized as inline comments.

🟠 Major comments (66)
test/util/rosacli/cluster_command_config.go-40-48 (1)

40-48: Potential index out of bounds panic when flag is the last token.

If a flag key is the last element in the command, accessing elements[i+1] will panic. Add a bounds check before accessing the next element.

 func (c *command) ReplaceFlagValue(flags map[string]string) {
 	elements := strings.Split(c.cmd, " ")
 	for i, e := range elements {
 		if value, ok := flags[e]; ok {
+			if i+1 >= len(elements) {
+				continue
+			}
 			elements[i+1] = value
 		}
 	}
 	c.cmd = strings.Join(elements, " ")
 }
test/util/rosacli/cluster_command_config.go-68-80 (1)

68-80: Index out of bounds in GetFlagValue when flag is the last token.

Accessing elements[i+1] without bounds checking will panic if the flag is the last element.

 func (c *command) GetFlagValue(flag string, flagWithVaue bool) string {
 	elements := strings.Split(c.cmd, " ")
 	for i, e := range elements {
 		if e == flag {
 			if flagWithVaue {
+				if i+1 >= len(elements) {
+					return ""
+				}
 				return elements[i+1]
 			} else {
 				return ""
 			}
 		}
 	}
 	return ""
 }
test/util/rosacli/cluster_config.go-120-127 (1)

120-127: Ignored error from os.ReadFile may mask issues.

On line 124, the error from ReadFile is discarded. If the file exists but can't be read (permissions, I/O error), the function silently returns an empty cluster ID.

 func GetClusterID() (clusterID string) {
 	clusterID = getClusterIDENVExisted()
 	if clusterID != "" {
 		return
 	}

 	if _, err := os.Stat(getClusterIDFile()); err != nil {
 		logger.Errorf("Cluster id file not existing")
 		return ""
 	}
-	fileCont, _ := os.ReadFile(getClusterIDFile())
+	fileCont, err := os.ReadFile(getClusterIDFile())
+	if err != nil {
+		logger.Errorf("Error reading cluster id file: %v", err)
+		return ""
+	}
 	clusterID = string(fileCont)
 	return
 }
test/util/rosacli/cluster_service.go-225-234 (1)

225-234: Shared runner state modification may cause issues in concurrent test execution.

The method modifies the runner's format state (JsonFormat() / UnsetFormat()) which affects the shared client. If tests run concurrently, this could cause race conditions or unexpected output formats.

Additionally, the error message on line 229 is misleading—it always references IsUsingReusableOIDCConfig regardless of which caller invoked this method.

 func (c *clusterService) getJSONClusterDescription(clusterID string) (*jsonData, error) {
 	c.client.Runner.JsonFormat()
+	defer c.client.Runner.UnsetFormat()
 	output, err := c.DescribeCluster(clusterID)
 	if err != nil {
-		logger.Errorf("it met error when describeCluster in IsUsingReusableOIDCConfig is %v", err)
+		logger.Errorf("error describing cluster %s: %v", clusterID, err)
 		return nil, err
 	}
-	c.client.Runner.UnsetFormat()
 	return c.client.Parser.JsonData.Input(output).Parse(), nil
 }
test/util/rosacli/cluster_command_config.go-51-65 (1)

51-65: Index out of bounds panic in DeleteFlag when flag is near the end.

When flagWithVaue is true and the flag is the last or second-to-last token, elements[i+2:] will panic. Similarly, the i+1 slice could fail if the flag lacks a value.

 func (c *command) DeleteFlag(flag string, flagWithVaue bool) error {
 	elements := strings.Split(c.cmd, " ")
 	for i, e := range elements {
 		if e == flag {
 			if flagWithVaue {
+				if i+2 > len(elements) {
+					return fmt.Errorf("flag %s found but missing value in command %s", flag, c.cmd)
+				}
 				elements = append(elements[:i], elements[i+2:]...)
 			} else {
 				elements = append(elements[:i], elements[i+1:]...)
 			}
 			c.cmd = strings.Join(elements, " ")
 			return nil
 		}
 	}
 	return fmt.Errorf("cannot find flag %s in command %s", flag, c.cmd)
 }
test/util/jenkins/monitor.go-78-92 (1)

78-92: Goroutine resource leak when ticker is stopped.

When ticker.Stop() is called, the channel is not closed—it just stops sending values. The goroutine will block indefinitely on range ticker.C, creating a resource leak. Consider using a done channel or context for graceful shutdown.

-func StartJenkinsGCTracking(oc *exutil.CLI, jenkinsNamespace string) *time.Ticker {
+func StartJenkinsGCTracking(oc *exutil.CLI, jenkinsNamespace string) (ticker *time.Ticker, stop func()) {
 	jenkinsPod := FindJenkinsPod(oc)
-	ticker := time.NewTicker(10 * time.Second)
+	ticker = time.NewTicker(10 * time.Second)
+	done := make(chan struct{})
 	go func() {
-		for t := range ticker.C {
-			stats, err := oc.Run("rsh").Args("--namespace", jenkinsNamespace, jenkinsPod.Name, "jstat", "-gcutil", "1").Output()
-			if err == nil {
-				fmt.Fprintf(g.GinkgoWriter, "\n\nJenkins gc stats %v\n%s\n\n", t, stats)
-			} else {
-				fmt.Fprintf(g.GinkgoWriter, "Unable to acquire Jenkins gc stats: %v", err)
+		for {
+			select {
+			case t := <-ticker.C:
+				stats, err := oc.Run("rsh").Args("--namespace", jenkinsNamespace, jenkinsPod.Name, "jstat", "-gcutil", "1").Output()
+				if err == nil {
+					fmt.Fprintf(g.GinkgoWriter, "\n\nJenkins gc stats %v\n%s\n\n", t, stats)
+				} else {
+					fmt.Fprintf(g.GinkgoWriter, "Unable to acquire Jenkins gc stats: %v", err)
+				}
+			case <-done:
+				return
 			}
 		}
 	}()
-	return ticker
+	return ticker, func() { ticker.Stop(); close(done) }
 }
test/util/ldap.go-42-163 (1)

42-163: Add resource cleanup on error to prevent test namespace pollution.

The function creates multiple Kubernetes resources (ConfigMaps, Service, ServiceAccount, Secret, Roles, Deployment) but does not clean them up if a later step fails. This can leave the test namespace in an inconsistent state and cause conflicts on retry.

Consider one of these approaches:

  1. Defer-based cleanup that tracks created resources and deletes them on error
  2. Early validation to fail fast before creating any resources
  3. Idempotent creates using CreateOrUpdate patterns to handle retries

Example using defer-based cleanup:

func CreateLDAPTestServer(oc *CLI) (string, []byte, error) {
	deploy, ldapService, ldif, scripts := ReadLDAPServerTestData()
	
	var createdResources []func() error
	cleanup := func() {
		for i := len(createdResources) - 1; i >= 0; i-- {
			_ = createdResources[i]()
		}
	}
	
	certDir, err := os.MkdirTemp("", "testca")
	if err != nil {
		return "", nil, err
	}
	defer os.RemoveAll(certDir)

	if _, err := oc.AdminKubeClient().CoreV1().ConfigMaps(oc.Namespace()).Create(...); err != nil {
		return "", nil, err
	}
	createdResources = append(createdResources, func() error {
		return oc.AdminKubeClient().CoreV1().ConfigMaps(oc.Namespace()).Delete(context.Background(), ldif.Name, metav1.DeleteOptions{})
	})
	
	// ... continue for other resources
	
	// On success, clear cleanup
	defer func() {
		if err == nil {
			createdResources = nil
		} else {
			cleanup()
		}
	}()
	
	return serviceHost, caPEM, nil
}
test/util/disruption/disruption.go-77-85 (1)

77-85: Silent error handling may hide test report failures.

Errors from file creation (line 81) and XML encoding (line 84) are silently ignored. In CI environments, this could result in missing JUnit reports without any indication of failure.

Consider logging errors to aid debugging:

 		if framework.TestContext.ReportDir != "" {
 			fname := filepath.Join(framework.TestContext.ReportDir, fmt.Sprintf("junit_%s_%d.xml", testSuite.Package, time.Now().Unix()))
 			f, err := os.Create(fname)
 			if err != nil {
+				framework.Logf("Failed to create JUnit report file %s: %v", fname, err)
 				return
 			}
 			defer f.Close()
-			xml.NewEncoder(f).Encode(testSuite)
+			if err := xml.NewEncoder(f).Encode(testSuite); err != nil {
+				framework.Logf("Failed to encode JUnit report: %v", err)
+			}
 		}
test/util/container/quay_client.go-7-7 (1)

7-7: Replace deprecated ioutil package.

The ioutil package has been deprecated since Go 1.16. Use os.ReadFile (line 62) and io.ReadAll (lines 151, 210) instead.

Apply this diff:

-	"io/ioutil"
+	"io"

Then update the usages:

  • Line 62: ioutil.ReadFileos.ReadFile
  • Lines 151, 210: ioutil.ReadAllio.ReadAll

Committable suggestion skipped: line range outside the PR's diff.

test/util/ibmcloud_client.go-327-344 (1)

327-344: Return error when instance is not found.

Line 343 returns empty strings with nil error when the instance is not found. This is ambiguous—callers cannot distinguish between "instance not found" and "instance found with empty ID/status."

Apply this diff:

 		}
 	}
 
-	return "", "", nil
+	return "", "", fmt.Errorf("instance not found: %s", instanceName)
 }

Committable suggestion skipped: line range outside the PR's diff.

test/util/db/common.go-54-61 (1)

54-61: Silently swallowing ExitError may hide legitimate failures.

Returning ("", nil) for ExitError makes it impossible for callers to distinguish between a command that produced no output and a command that failed. This can mask database connection failures or query errors.

Consider returning the error or at least the exit code:

 	if err != nil {
 		switch err.(type) {
 		case *util.ExitError, *exec.ExitError:
-			return "", nil
+			return "", err // Let caller decide how to handle
 		default:
 			return "", err
 		}
 	}
test/util/machine_helpers.go-51-53 (1)

51-53: Ignored time.Parse errors can cause incorrect comparison results.

If parsing fails, t1 or t2 will be zero-valued, leading to incorrect or misleading comparison results. This is particularly problematic for test utilities where silent failures can mask real issues.

-	t1, _ := time.Parse(time.RFC3339, p10CreationTime)
-	t2, _ := time.Parse(time.RFC3339, p20CreationTime)
-	return !(t1.Before(t2) || t1.Equal(t2))
+	t1, err := time.Parse(time.RFC3339, p10CreationTime)
+	if err != nil {
+		// Log or handle parse error - caller should be aware
+		return false
+	}
+	t2, err := time.Parse(time.RFC3339, p20CreationTime)
+	if err != nil {
+		return false
+	}
+	return t1.After(t2)

Note: !(t1.Before(t2) || t1.Equal(t2)) simplifies to t1.After(t2).

test/util/deploymentconfigs.go-23-35 (1)

23-35: Polling stops on transient errors and waits even after failed deletion.

Two issues:

  1. Line 27 returns err immediately on GetApplicationPods failure, causing the poll to fail on transient API errors instead of retrying.
  2. The poll runs even when DC deletion failed (lines 18-21), which wastes time waiting for pods that may never terminate.

Consider this approach:

-	err := wait.PollImmediate(5*time.Second, 5*time.Minute, func() (bool, error) {
+	if err := oc.AdminAppsClient().AppsV1().DeploymentConfigs(oc.Namespace()).Delete(context.Background(), dc, metav1.DeleteOptions{}); err != nil {
+		e2e.Logf("Error occurred removing deployment config: %v", err)
+		errs = append(errs, err)
+		continue // Skip waiting if deletion failed
+	}
+
+	err := wait.PollImmediate(5*time.Second, 5*time.Minute, func() (bool, error) {
 		pods, err := GetApplicationPods(oc, dc)
 		if err != nil {
 			e2e.Logf("Unable to get pods for dc/%s: %v", dc, err)
-			return false, err
+			return false, nil // Retry on transient errors
 		}

Committable suggestion skipped: line range outside the PR's diff.

test/util/nfs.go-39-50 (1)

39-50: Poll error is not checked before creating PVs.

The wait.PollImmediate on line 39 may return an error (e.g., timeout), but the code proceeds to create PVs regardless. If the NFS server pod never becomes Running, PV creation will fail with a non-functional server.

 	err = wait.PollImmediate(5*time.Second, 1*time.Minute, func() (bool, error) {
 		// ...
 	})
+	if err != nil {
+		return nil, nil, fmt.Errorf("NFS server pod failed to reach Running state: %w", err)
+	}
 
 	pvs := []*kapiv1.PersistentVolume{}
test/util/nfs.go-66-73 (1)

66-73: PV creation errors are silently ignored; only the last error is returned.

When PV creation fails (line 67), the error is logged but the loop continues, overwriting err. The returned err on line 73 only reflects the last iteration's error, not earlier failures.

Accumulate errors properly:

+	var pvErrors []error
 	for i := 0; i < count; i++ {
 		e2e.Logf(fmt.Sprintf("Creating persistent volume %d", i))
 		pvConfig := e2epv.PersistentVolumeConfig{
 			// ...
 		}
 		pvTemplate := e2epv.MakePersistentVolume(pvConfig)
 		pv, err := oc.AdminKubeClient().CoreV1().PersistentVolumes().Create(context.Background(), pvTemplate, metav1.CreateOptions{})
 		if err != nil {
 			e2e.Logf("error creating persistent volume %#v", err)
+			pvErrors = append(pvErrors, err)
+			continue
 		}
 		e2e.Logf("Created persistent volume %#v", pv)
 		pvs = append(pvs, pv)
 	}
-	return pod, pvs, err
+	if len(pvErrors) > 0 {
+		return pod, pvs, fmt.Errorf("failed to create %d PVs: %v", len(pvErrors), pvErrors)
+	}
+	return pod, pvs, nil

Committable suggestion skipped: line range outside the PR's diff.

test/util/db/sqlit.go-80-85 (1)

80-85: Unchecked rows.Scan errors may hide data corruption.

Multiple rows.Scan calls ignore the returned error. Scan failures (e.g., type mismatches, NULL values) will go undetected.

-		rows.Scan(&name, &bundlepath, &version)
+		if err := rows.Scan(&name, &bundlepath, &version); err != nil {
+			return nil, fmt.Errorf("failed to scan row: %w", err)
+		}

Also applies to: 133-134, 158-159, 179-180

test/util/release.go-88-102 (1)

88-102: Resource leak: defer resp.Body.Close() should be placed immediately after the nil check.

The current code reads the body before deferring the close. If io.ReadAll fails, the body may not be properly closed in all code paths.

 	resp, err = http.Get(url)
 	if err != nil {
 		err = fmt.Errorf("fail to get url %s, error: %v", url, err)
 		return "", err
 	}
+	defer resp.Body.Close()
 	body, err = io.ReadAll(resp.Body)
-	defer resp.Body.Close()
 	if err != nil {
test/util/db/sqlit.go-117-143 (1)

117-143: Potential nil pointer dereference: defer rows.Close() before error check.

If QueryDB returns an error, rows could be nil, causing a panic when defer rows.Close() executes. The same issue exists in QueryPackge (line 153), and QueryRelatedImage (line 174).

 func (c *Sqlit) QueryOperatorChannel(dbFilePath string) ([]Channel, error) {
 	rows, err := c.QueryDB(dbFilePath, "select * from channel_entry;")
+	if err != nil {
+		return nil, err
+	}
+	defer rows.Close()
 	var (
 		Channels            []Channel
 		// ...
 	)
-	defer rows.Close()
-	if err != nil {
-		return Channels, err
-	}
test/util/cloud/cloud.go-71-75 (1)

71-75: Inverted logic for NumNodes assignment.

When len(nonMasters.Items) == 0, you assign NumNodes = len(nonMasters.Items) which is 0. In the else branch (when non-masters exist), you use the masters count. This appears inverted—typically you'd use non-master (worker) count when available.

-	if len(nonMasters.Items) == 0 {
-		cloudConfig.NumNodes = len(nonMasters.Items)
-	} else {
+	if len(nonMasters.Items) > 0 {
+		cloudConfig.NumNodes = len(nonMasters.Items)
+	} else {
 		cloudConfig.NumNodes = len(masters.Items)
 	}
test/cvo/cvo.go-20-25 (1)

20-25: Run() returns a builder, not an error—this assertion will always pass.

Based on the CLI struct pattern in test/util/client.go, Run() typically returns a builder object for method chaining (e.g., .Args().Execute()). Assigning it directly to err will likely be a non-nil builder object, but the type assertion o.Expect(err).NotTo(o.HaveOccurred()) expects an error type.

You likely need to call .Execute() to get the actual error:

 g.It("Ingress to CVO is not breaking for monitoring scrape", func() {
 	exutil.By("Testing my commands")
-	err := oc.AsAdmin().WithoutNamespace().Run("version")
+	err := oc.AsAdmin().WithoutNamespace().Run("version").Execute()
 	o.Expect(err).NotTo(o.HaveOccurred())
-	o.Expect(true).To(o.BeTrue())
 })

Also, the o.Expect(true).To(o.BeTrue()) on line 24 is redundant after a real assertion.

test/util/ccoctl.go-42-42 (1)

42-42: os.Remove won't delete a non-empty directory.

os.Remove only removes empty directories or files. After GetPullSec writes .dockerconfigjson to this directory, the removal will fail silently. Use os.RemoveAll instead.

-	defer os.Remove(pullSecretDirName)
+	defer os.RemoveAll(pullSecretDirName)
test/util/rosa.go-27-31 (1)

27-31: Misleading skip message and potential shell injection.

  1. The skip message mentions TEST_ROSA_LOGIN_ENV but the check is for TEST_ROSA_TOKEN.
  2. Passing the token directly in the shell command string risks command injection if the token contains shell metacharacters. Use environment variables or direct exec instead.
 	if len(os.Getenv("TEST_ROSA_TOKEN")) == 0 {
-		g.Skip("env TEST_ROSA_LOGIN_ENV not set")
+		g.Skip("env TEST_ROSA_TOKEN not set")
 	}
-	cmd := fmt.Sprintf(`rosa login --env "%s" --token "%s"`, os.Getenv("TEST_ROSA_LOGIN_ENV"), os.Getenv("TEST_ROSA_TOKEN"))
-	_, err := exec.Command("bash", "-c", cmd).Output()
+	cmd := exec.Command("rosa", "login", "--env", os.Getenv("TEST_ROSA_LOGIN_ENV"), "--token", os.Getenv("TEST_ROSA_TOKEN"))
+	_, err := cmd.Output()
test/util/assert.go-55-65 (1)

55-65: OrFail will panic on empty input or type mismatch.

Two potential panics:

  1. If vals is empty, vals[0] will panic with index out of range.
  2. If vals[0] is not of type T, the type assertion will panic.
 func OrFail[T any](vals ...any) T {
+	if len(vals) == 0 {
+		g.Fail("OrFail called with no arguments")
+		var zero T
+		return zero
+	}
 
 	for _, val := range vals {
 		err, ok := val.(error)
 		if ok {
 			o.ExpectWithOffset(1, err).NotTo(o.HaveOccurred())
 		}
 	}
 
-	return vals[0].(T)
+	result, ok := vals[0].(T)
+	if !ok {
+		g.Fail(fmt.Sprintf("OrFail: first value is not of expected type, got %T", vals[0]))
+	}
+	return result
 }

Committable suggestion skipped: line range outside the PR's diff.

test/util/manifest.go-43-54 (1)

43-54: File handle is never closed - resource leak.

os.Open creates a file handle that is never closed, causing a resource leak.

 func GetFileContent(baseDir string, name string) (fileContent string) {
 	filePath := filepath.Join(FixturePath("testdata", baseDir), name)
 	fileOpen, err := os.Open(filePath)
 	if err != nil {
 		e2e.Failf("Failed to open file: %s", filePath)
 	}
+	defer fileOpen.Close()
 	fileRead, err := io.ReadAll(fileOpen)
 	if err != nil {
 		e2e.Failf("Failed to read file: %s", filePath)
 	}
 	return string(fileRead)
 }

Alternatively, use os.ReadFile which handles this internally:

func GetFileContent(baseDir string, name string) string {
	filePath := filepath.Join(FixturePath("testdata", baseDir), name)
	data, err := os.ReadFile(filePath)
	if err != nil {
		e2e.Failf("Failed to read file %s: %v", filePath, err)
	}
	return string(data)
}
test/util/oc_copy.go-56-68 (1)

56-68: Potential token exposure when used as username.

When SelfSubjectReview fails, the bearer token is used as the username (line 60). If this username appears in logs, config files, or error messages, it could expose sensitive credentials.

Consider using a placeholder or masked value instead of the raw token:

 	case len(clientCfg.BearerToken) > 0:
-		username = clientCfg.BearerToken
+		username = "token-user" // avoid exposing token as username
test/util/manifest.go-69-73 (1)

69-73: Potential panic if filename has no extension.

If manifestFile contains no ., strings.Split returns a single-element slice, and accessing splitFileName[1] will panic.

-	splitFileName := strings.Split(manifestFile, ".")
-	manifestFileName := splitFileName[0] + strings.Replace(ts, ":", "", -1) + "." + splitFileName[1]
+	ext := filepath.Ext(manifestFile)
+	base := strings.TrimSuffix(manifestFile, ext)
+	manifestFileName := base + strings.ReplaceAll(ts, ":", "") + ext
test/util/networking.go-54-63 (1)

54-63: Potential panic if MCP output parsing fails.

If the oc get mcp command fails or returns unexpected output, strings.Fields(machineCount) could return fewer than 4 elements, causing an index out of bounds panic on line 59.

 	err := wait.PollUntilContextTimeout(context.TODO(), interval, timeout, false, func(ctx context.Context) (bool, error) {
-		machineCount, _ = oc.AsAdmin().WithoutNamespace().Run("get").Args("mcp", mcp, "-o=jsonpath={.status.machineCount}{\" \"}{.status.readyMachineCount}{\" \"}{.status.unavailableMachineCount}{\" \"}{.status.degradedMachineCount}").Output()
+		var err error
+		machineCount, err = oc.AsAdmin().WithoutNamespace().Run("get").Args("mcp", mcp, "-o=jsonpath={.status.machineCount}{\" \"}{.status.readyMachineCount}{\" \"}{.status.unavailableMachineCount}{\" \"}{.status.degradedMachineCount}").Output()
+		if err != nil {
+			e2e.Logf("Failed to get MCP %s: %v", mcp, err)
+			return false, nil
+		}
 		indexCount := strings.Fields(machineCount)
+		if len(indexCount) < 4 {
+			e2e.Logf("Unexpected MCP output format: %s", machineCount)
+			return false, nil
+		}
 		if strings.Compare(indexCount[0], indexCount[1]) == 0 && strings.Compare(indexCount[2], "0") == 0 && strings.Compare(indexCount[3], "0") == 0 {
test/util/clusterinfra/machineset_helper.go-33-42 (1)

33-42: Ignored errors from sjson.Set calls.

All sjson.Set calls ignore errors. If any fail, subsequent operations use incomplete JSON, potentially creating invalid MachineSet resources.

-	bytes, _ := ioutil.ReadFile(machineSetJSON)
-	machinesetjsonWithName, _ := sjson.Set(string(bytes), "metadata.name", ms.Name)
+	bytes, err := os.ReadFile(machineSetJSON)
+	o.Expect(err).NotTo(o.HaveOccurred())
+	machinesetjsonWithName, err := sjson.Set(string(bytes), "metadata.name", ms.Name)
+	o.Expect(err).NotTo(o.HaveOccurred())
 	// ... similar for other sjson.Set calls

Committable suggestion skipped: line range outside the PR's diff.

test/util/oauthserver/tokencmd/request_token_test.go-679-728 (1)

679-728: Same resource leak pattern: defer s.Close() inside loop.

Line 696 has the same issue as TestRequestToken. Move the close to the end of the loop iteration or use a subtest with t.Run() to properly scope the defer.

-		defer s.Close()
-
 		opts := &RequestTokenOptions{
 			ClientConfig: &restclient.Config{Host: tc.hostWrapper(s.URL)},
 			TokenFlow:    tc.tokenFlow,
@@ -724,5 +722,6 @@
 		if !reflect.DeepEqual(*tc.expectedConfig, *opts.OsinConfig) {
 			t.Errorf("%s: expected osin config does not match, %s", tc.name, diff.ObjectDiff(*tc.expectedConfig, *opts.OsinConfig))
 		}
+		s.Close()
 	}
 }
test/util/clusterinfra/machineset_helper.go-48-61 (1)

48-61: Inverted logic in failure status handling.

The condition logic is confusing and appears inverted:

  • WaitForMachineFailedToSkip returns nil when machine IS in "Failed" phase (line 103)
  • Line 50: if FailedStatus == nil → skip, which is correct (machine failed)
  • Line 55: if FailedStatus.Error() != "timed out..." → this path means poll returned early with a different error, but you skip saying "Failed due to invalid configuration"

The logic flow is hard to follow. Consider renaming the function or restructuring:

-		FailedStatus := WaitForMachineFailedToSkip(oc, ms.Name)
-		e2e.Logf("FailedStatus: %v\n", FailedStatus)
-		if FailedStatus == nil {
+		// WaitForMachineFailedToSkip returns nil if machine enters Failed phase
+		machineFailedErr := WaitForMachineFailedToSkip(oc, ms.Name)
+		if machineFailedErr == nil {
+			// Machine is in Failed state - cleanup and skip
 			ms.DeleteMachineSet(oc)
-			g.Skip("Something wrong invalid configuration for machines , not worth to continue")
-
+			g.Skip("Machine entered Failed phase - invalid configuration")
 		}
-		if FailedStatus.Error() != "timed out waiting for the condition" {
-
-			e2e.Logf("Check machineset yaml , machine is in failed status ...")
+		if machineFailedErr.Error() != "timed out waiting for the condition" {
+			// Unexpected error during polling
 			ms.DeleteMachineSet(oc)
-			g.Skip(" Failed due to invalid configuration for machines, not worth to continue")
+			g.Skip(fmt.Sprintf("Unexpected error waiting for machine: %v", machineFailedErr))
 		}
+		// Timeout means machine didn't fail - proceed to check if running
 		ms.AssertLabelledMachinesRunningDeleteIfNot(oc, ms.Replicas, ms.Name)
test/util/oauthserver/tokencmd/request_token_test.go-433-509 (1)

433-509: Resource leak: defer s.Close() inside loop causes server accumulation.

The defer s.Close() on line 480 is inside the test case loop, meaning all HTTP test servers will remain open until TestRequestToken returns. With 30+ test cases, this accumulates open servers and their goroutines.

Apply this diff to close servers immediately after each test case:

-		defer s.Close()
-
 		opts := &RequestTokenOptions{
 			ClientConfig: &restclient.Config{Host: s.URL},
 			Handler:      tc.Handler,
@@ -505,6 +503,7 @@
 		if i != len(tc.Requests) {
 			t.Errorf("%s: expected %d requests, saw %d", k, len(tc.Requests), i)
 		}
 		verifyReleased(k, tc.Handler)
+		s.Close()
 	}
 }
test/util/db/mysql.go-88-103 (1)

88-103: Same command injection risk in TestRemoteLogin.

The hostAddress parameter is also interpolated into the shell command without sanitization.

test/util/db/postgresql.go-68-82 (1)

68-82: Same command injection risk in QueryPrivileged.

The query parameter has the same injection vulnerability as in Query.

test/util/db/mysql.go-73-86 (1)

73-86: Same command injection risk in QueryPrivileged.

test/util/db/mysql.go-57-71 (1)

57-71: Potential command injection vulnerability via unescaped query parameter.

Same issue as PostgreSQL helper - the query is directly interpolated into the shell command. MySQL's -e flag should receive the query as a separate argument rather than through shell interpolation.

-	return oc.Run("exec").Args(m.podName, "-c", container, "--", "bash", "-c",
-		fmt.Sprintf("mysql -h 127.0.0.1 -u%s -p%s -e \"%s\" %s",
-			masterConf.Env["MYSQL_USER"], masterConf.Env["MYSQL_PASSWORD"], query,
-			masterConf.Env["MYSQL_DATABASE"])).Output()
+	return oc.Run("exec").Args(m.podName, "-c", container, "--",
+		"mysql", "-h", "127.0.0.1",
+		fmt.Sprintf("-u%s", masterConf.Env["MYSQL_USER"]),
+		fmt.Sprintf("-p%s", masterConf.Env["MYSQL_PASSWORD"]),
+		"-e", query,
+		masterConf.Env["MYSQL_DATABASE"]).Output()
test/util/oauthserver/tokencmd/basicauth.go-54-56 (1)

54-56: Avoid using panic for unsupported code paths.

Using panic("unsupported") for the interactive prompting case is problematic. If this path is ever reached unexpectedly, it will crash the test runner. Return an error instead.

 	if (missingUsername || missingPassword) && c.Reader != nil {
-		panic("unsupported")
+		return nil, false, errors.New("interactive prompting is not supported")
 	}
test/util/db/postgresql.go-52-66 (1)

52-66: Potential command injection vulnerability via unescaped query parameter.

The query parameter is directly interpolated into a shell command via bash -c. If query contains shell metacharacters (e.g., "; rm -rf /), it could lead to command injection.

Consider using -- to separate psql options from the query, or pass the query via stdin instead of command interpolation:

-	return oc.Run("exec").Args(m.podName, "-c", container, "--", "bash", "-c",
-		fmt.Sprintf("psql postgres://%s:%[email protected]/%s -x -c \"%s\"",
-			masterConf.Env["POSTGRESQL_USER"], masterConf.Env["POSTGRESQL_PASSWORD"],
-			masterConf.Env["POSTGRESQL_DATABASE"], query)).Output()
+	return oc.Run("exec").Args(m.podName, "-c", container, "--",
+		"psql", fmt.Sprintf("postgres://%s:%[email protected]/%s",
+			masterConf.Env["POSTGRESQL_USER"], masterConf.Env["POSTGRESQL_PASSWORD"],
+			masterConf.Env["POSTGRESQL_DATABASE"]), "-x", "-c", query).Output()
test/util/nutanix_client.go-154-155 (1)

154-155: HTTP client lacks timeout; requests may hang indefinitely.

Creating an http.Client without a timeout can cause goroutine leaks if the Nutanix API is unresponsive.

-	client := &http.Client{}
+	client := &http.Client{
+		Timeout: 60 * time.Second,
+	}
test/util/architecture/architecture.go-172-176 (1)

172-176: Error from GetFirstMasterNode is ignored, then shadowed.

The first err from GetFirstMasterNode is never checked. If it fails, masterNode could be empty, causing the subsequent command to fail with a confusing error.

 func GetControlPlaneArch(oc *exutil.CLI) Architecture {
 	masterNode, err := exutil.GetFirstMasterNode(oc)
+	o.Expect(err).NotTo(o.HaveOccurred())
-	architecture, err := oc.AsAdmin().WithoutNamespace().Run("get").Args("node", masterNode, "-o=jsonpath={.status.nodeInfo.architecture}").Output()
+	archOutput, err := oc.AsAdmin().WithoutNamespace().Run("get").Args("node", masterNode, "-o=jsonpath={.status.nodeInfo.architecture}").Output()
 	o.Expect(err).NotTo(o.HaveOccurred())
-	return FromString(architecture)
+	return FromString(archOutput)
 }
test/util/nutanix_client.go-210-214 (1)

210-214: defer respPut.Body.Close() inside loop causes resource leak.

The defer won't execute until the function returns, causing response bodies to accumulate if multiple retries occur. Close the body immediately after reading or before the next iteration.

 		respPut, err := client.Do(reqPut)
 		if err != nil {
 			return fmt.Errorf("error sending request: %v", err)
 		}
-		defer respPut.Body.Close()

 		if respPut.StatusCode == http.StatusOK || respPut.StatusCode == http.StatusAccepted {
+			respPut.Body.Close()
 			return nil
 		} else if respPut.StatusCode == http.StatusConflict && attempt < maxRetries-1 {
 			fmt.Printf("Conflict detected, retrying in %v seconds... (attempt %d/%d)\n", retryDelay.Seconds(), attempt+1, maxRetries)
+			respPut.Body.Close()
 			time.Sleep(retryDelay)
 			continue
 		}

 		// Read the response body for debugging purposes
 		respBody, err := ioutil.ReadAll(respPut.Body)
+		respPut.Body.Close()
 		if err != nil {
 			return fmt.Errorf("error reading response body: %v", err)
 		}
test/util/nutanix_client.go-193-194 (1)

193-194: Unchecked type assertions can panic.

If the JSON structure doesn't match expectations, these nested type assertions will panic at runtime.

-	delete(vmData, "status")
-	vmData["spec"].(map[string]interface{})["resources"].(map[string]interface{})["power_state"] = targetState
+	delete(vmData, "status")
+	spec, ok := vmData["spec"].(map[string]interface{})
+	if !ok {
+		return fmt.Errorf("unexpected JSON structure: missing or invalid 'spec'")
+	}
+	resources, ok := spec["resources"].(map[string]interface{})
+	if !ok {
+		return fmt.Errorf("unexpected JSON structure: missing or invalid 'resources'")
+	}
+	resources["power_state"] = targetState
test/util/clusterinfra/machineset_nonspot.go-34-38 (1)

34-38: Ignored errors from file operations and JSON manipulation.

Same issue as in machine_arch.go - errors from ioutil.ReadFile and sjson.Set are silently ignored. This can lead to confusing test failures.

Handle errors properly using o.Expect(err).NotTo(o.HaveOccurred()) pattern as done elsewhere in this file (see lines 82-91 for the correct pattern in CreateMachineSetBasedOnExisting).

test/util/oauthserver/oauthserver.go-207-220 (1)

207-220: Unbounded polling can cause test hangs.

wait.PollImmediateInfinite will block indefinitely if the pod never becomes ready. This can cause tests to hang without clear failure. Consider using wait.PollImmediate with a reasonable timeout (e.g., 10-15 minutes) to allow for graceful failure.

 func waitForOAuthServerPodReady(oc *exutil.CLI) error {
 	e2e.Logf("Waiting for the OAuth server pod to be ready")
-	return wait.PollImmediateInfinite(1*time.Second, func() (bool, error) {
+	return wait.PollImmediate(1*time.Second, 15*time.Minute, func() (bool, error) {
 		pod, err := oc.AdminKubeClient().CoreV1().Pods(oc.Namespace()).Get(context.Background(), "test-oauth-server", metav1.GetOptions{})
test/util/clusterinfra/machineset_nonspot.go-40-42 (1)

40-42: Incorrect JSON pattern for GCP spot removal.

The GCP preemptible pattern uses incorrect JSON syntax. The colon and value should be separated from the key.

 	machinesetjsonNonSpot := strings.ReplaceAll(machinesetjsonWithReplicas, "\"spotVMOptions\": {},", "") //azure
 	machinesetjsonNonSpot = strings.ReplaceAll(machinesetjsonNonSpot, "\"spotMarketOptions\": {},", "")   //aws
-	machinesetjsonNonSpot = strings.ReplaceAll(machinesetjsonNonSpot, "\"preemptible: true\",", "")       //gcp
+	machinesetjsonNonSpot = strings.ReplaceAll(machinesetjsonNonSpot, "\"preemptible\": true,", "")       //gcp

The same fix is needed on line 97.

test/util/oauthserver/oauthserver.go-475-482 (1)

475-482: Potential panic on empty pods list.

Accessing pods.Items[0] without checking if the list is empty will cause a panic if no pods match the selector.

 func getImage(oc *exutil.CLI) (string, error) {
 	selector, _ := labels.Parse("app=oauth-openshift")
 	pods, err := oc.AdminKubeClient().CoreV1().Pods("openshift-authentication").List(context.Background(), metav1.ListOptions{LabelSelector: selector.String()})
 	if err != nil {
 		return "", err
 	}
+	if len(pods.Items) == 0 {
+		return "", fmt.Errorf("no oauth-openshift pods found in openshift-authentication namespace")
+	}
 	return pods.Items[0].Spec.Containers[0].Image, nil
 }
test/util/oauthserver/oauthserver.go-493-495 (1)

493-495: Avoid panic in library code.

Using panic for error handling prevents callers from recovering gracefully. Return the error instead.

-	if err := osincli.PopulatePKCE(oauthClientConfig); err != nil {
-		panic(err)
-	}
+	if err := osincli.PopulatePKCE(oauthClientConfig); err != nil {
+		e2e.Logf("Error populating PKCE config: %v", err)
+		return nil
+	}

Committable suggestion skipped: line range outside the PR's diff.

test/util/clusterinfra/machine_arch.go-24-32 (1)

24-32: Ignored errors could mask failures.

Multiple errors from file read and JSON operations are silently ignored. If any of these operations fail, the test will proceed with corrupted data leading to confusing failures.

-	bytes, _ := ioutil.ReadFile(machineSetJSON)
-	machinesetjsonWithName, _ := sjson.Set(string(bytes), "metadata.name", ms.Name)
-	machinesetjsonWithSelector, _ := sjson.Set(machinesetjsonWithName, "spec.selector.matchLabels.machine\\.openshift\\.io/cluster-api-machineset", ms.Name)
+	bytes, err := os.ReadFile(machineSetJSON)
+	o.Expect(err).NotTo(o.HaveOccurred())
+	machinesetjsonWithName, err := sjson.Set(string(bytes), "metadata.name", ms.Name)
+	o.Expect(err).NotTo(o.HaveOccurred())
+	machinesetjsonWithSelector, err := sjson.Set(machinesetjsonWithName, "spec.selector.matchLabels.machine\\.openshift\\.io/cluster-api-machineset", ms.Name)
+	o.Expect(err).NotTo(o.HaveOccurred())

Apply similar pattern to all subsequent sjson.Set calls.

Committable suggestion skipped: line range outside the PR's diff.

test/util/azure_creds.go-125-138 (1)

125-138: Brittle JSON key transformation could corrupt data.

Using bytes.ReplaceAll to rename JSON keys could inadvertently modify values that contain these substrings. Consider using proper JSON unmarshaling with custom field mappings or a struct with both tag variants.

 func (ac *AzureCredentialsFromFile) LoadFromFile(filePath string) error {
 	fileData, err := os.ReadFile(filePath)
 	if err != nil {
 		return fmt.Errorf("error reading credentials file: %v", err)
 	}
-	fileData = bytes.ReplaceAll(fileData, []byte("azure_subscription_id"), []byte("subscriptionId"))
-	fileData = bytes.ReplaceAll(fileData, []byte("azure_client_id"), []byte("clientId"))
-	fileData = bytes.ReplaceAll(fileData, []byte("azure_client_secret"), []byte("clientSecret"))
-	fileData = bytes.ReplaceAll(fileData, []byte("azure_tenant_id"), []byte("tenantId"))
-	if err = json.Unmarshal(fileData, ac); err != nil {
-		return fmt.Errorf("error unmarshaling credentials file: %v", err)
+	// Try unmarshaling with both key formats
+	var rawData map[string]interface{}
+	if err = json.Unmarshal(fileData, &rawData); err != nil {
+		return fmt.Errorf("error unmarshaling credentials file: %v", err)
+	}
+	// Map both snake_case and camelCase keys
+	if v, ok := rawData["azure_subscription_id"]; ok {
+		ac.AzureSubscriptionID, _ = v.(string)
+	} else if v, ok := rawData["subscriptionId"]; ok {
+		ac.AzureSubscriptionID, _ = v.(string)
 	}
+	// ... similar for other fields
 	return nil
 }

Committable suggestion skipped: line range outside the PR's diff.

test/util/oauthserver/tokencmd/request_token.go-340-348 (1)

340-348: req.ParseForm() result is ignored, and req.Form may be nil for GET requests.

http.NewRequest with GET method doesn't populate req.Form. ParseForm() on a request without a body won't populate form values. The code should parse URL query parameters instead.

-	req.ParseForm()
-	if oauthErr := oauthErrFromValues(req.Form); oauthErr != nil {
+	if oauthErr := oauthErrFromValues(req.URL.Query()); oauthErr != nil {
 		return "", oauthErr
 	}
-	if len(req.Form.Get("code")) == 0 {
+	if len(req.URL.Query().Get("code")) == 0 {
 		return "", nil // no code parameter so this is not part of the OAuth flow
 	}
test/util/azure_client.go-315-323 (1)

315-323: Insecure temp directory and error handling.

  1. Predictable /tmp path vulnerable to symlink attacks.
  2. MkdirAll error is ignored (assigned to _).
-	dirname := "/tmp/" + oc.Namespace() + "-creds"
+	dirname, err := os.MkdirTemp("", "azure-creds-*")
+	if err != nil {
+		return accountName, "", fmt.Errorf("failed to create temp dir: %v", err)
+	}
 	defer os.RemoveAll(dirname)
-	_ = os.MkdirAll(dirname, 0777)
test/util/oauthserver/tokencmd/request_token.go-219-225 (1)

219-225: Deferred resp.Body.Close() inside loop causes resource accumulation.

Each iteration defers closing the response body, but these defers only execute when the function returns. In a loop with many iterations, this accumulates open connections until the function exits.

 	for {
 		// Make the request
 		resp, err := request(rt, requestURL, requestHeaders)
 		if err != nil {
 			return "", err
 		}
-		defer resp.Body.Close()
+		// Process response and ensure body is closed before next iteration

Consider restructuring to close the body explicitly at the end of each iteration or extracting the loop body into a separate function.

Committable suggestion skipped: line range outside the PR's diff.

test/util/azure_client.go-237-246 (1)

237-246: Silent error swallowing when credentials not found.

When both credential sources fail, the function returns nil error (line 243), silently hiding the failure. This could lead to confusing downstream errors.

 		credential, getSecErr = oc.AsAdmin().WithoutNamespace().Run("get").Args("secret/azure-cloud-credentials", "-n", "openshift-cloud-controller-manager", "-o=jsonpath={.data}").Output()
 		if getSecErr != nil {
-			return "", nil
+			return "", fmt.Errorf("failed to get Azure credentials from both kube-system and openshift-cloud-controller-manager: %v", getSecErr)
 		}
test/util/clusterinfra/cluster_helpers.go-39-42 (1)

39-42: Hardcoded vSphere server hostname is problematic.

This hardcodes a specific vSphere server URL which limits portability and may expose infrastructure details. Consider making this configurable via environment variable or cluster configuration.

-const (
-	//VsphereServer vSphere server hostname
-	VsphereServer = "vcenter.sddc-44-236-21-251.vmwarevmc.com"
-)
+// GetVsphereServer returns the vSphere server from environment or cluster config
+func GetVsphereServer(oc *exutil.CLI) (string, error) {
+	// Try environment variable first
+	if server := os.Getenv("VSPHERE_SERVER"); server != "" {
+		return server, nil
+	}
+	// Fall back to cluster infrastructure config
+	return oc.AsAdmin().WithoutNamespace().Run("get").Args(
+		"infrastructure", "cluster",
+		"-o=jsonpath={.spec.platformSpec.vsphere.vcenters[0].server}").Output()
+}

Committable suggestion skipped: line range outside the PR's diff.

test/util/openstack_util.go-52-67 (1)

52-67: Insecure temporary directory handling for credentials.

  1. Using predictable path under /tmp with namespace name is vulnerable to symlink attacks.
  2. Directory permissions 0777 are too permissive for credential storage.
  3. ioutil.ReadFile is deprecated.
 func GetOpenStackCredentials(oc *CLI) (*OpenstackCredentials, error) {
 	cred := &OpenstackCredentials{}
-	dirname := "/tmp/" + oc.Namespace() + "-creds"
+	dirname, err := os.MkdirTemp("", "openstack-creds-*")
+	if err != nil {
+		return cred, fmt.Errorf("failed to create temp dir: %v", err)
+	}
 	defer os.RemoveAll(dirname)
-	err := os.MkdirAll(dirname, 0777)
-	o.Expect(err).NotTo(o.HaveOccurred())
 
 	_, err = oc.AsAdmin().WithoutNamespace().Run("extract").Args("secret/openstack-credentials", "-n", "kube-system", "--confirm", "--to="+dirname).Output()
 	if err != nil {
 		return cred, err
 	}
 
-	confFile, err := ioutil.ReadFile(dirname + "/clouds.yaml")
+	confFile, err := os.ReadFile(dirname + "/clouds.yaml")
test/util/nodes.go-50-60 (1)

50-60: Potential panic on empty node list - missing bounds check.

GetFirstWorkerNode and GetFirstMasterNode access [0] without checking if the slice is empty. This will panic if no nodes with the specified role exist.

 func GetFirstWorkerNode(oc *CLI) (string, error) {
 	workerNodes, err := GetClusterNodesBy(oc, "worker")
+	if err != nil {
+		return "", err
+	}
+	if len(workerNodes) == 0 {
+		return "", fmt.Errorf("no worker nodes found")
+	}
-	return workerNodes[0], err
+	return workerNodes[0], nil
 }

 func GetFirstMasterNode(oc *CLI) (string, error) {
 	masterNodes, err := GetClusterNodesBy(oc, "master")
+	if err != nil {
+		return "", err
+	}
+	if len(masterNodes) == 0 {
+		return "", fmt.Errorf("no master nodes found")
+	}
-	return masterNodes[0], err
+	return masterNodes[0], nil
 }
test/util/nodes.go-253-258 (1)

253-258: Same bounds check issue - panics on empty slice.

GetFirstWorkerNodeByNodePoolNameInHostedCluster accesses workerNodes[0] without verifying the slice is non-empty.

 func GetFirstWorkerNodeByNodePoolNameInHostedCluster(oc *CLI, nodePoolName string) (string, error) {
 	workerNodes, err := GetAllNodesByNodePoolNameInHostedCluster(oc, nodePoolName)
 	o.Expect(err).NotTo(o.HaveOccurred())
+	o.Expect(len(workerNodes)).To(o.BeNumerically(">", 0), "Expected at least one worker node in nodepool %s", nodePoolName)
 	return workerNodes[0], err
 }
test/util/clusters.go-372-380 (1)

372-380: Potential panic on empty node list - missing bounds check.

IsAKSCluster accesses nodeList.Items[0] without checking if the list is empty. This will cause a runtime panic on clusters with no nodes.

 func IsAKSCluster(ctx context.Context, oc *CLI) (bool, error) {
 	nodeList, err := oc.AdminKubeClient().CoreV1().Nodes().List(ctx, metav1.ListOptions{})
 	if err != nil {
 		return false, fmt.Errorf("failed to list nodes: %w", err)
 	}
+	if len(nodeList.Items) == 0 {
+		return false, nil
+	}
 	_, labelFound := nodeList.Items[0].Labels[AKSNodeLabel]
 	return labelFound, nil
 }
test/util/prometheus_monitoring.go-261-269 (1)

261-269: Potential panic on empty results - missing bounds check.

ExtractSpecifiedValueFromMetricData4MemRSS accesses Result[0] without checking if the Result array is empty. This will cause a runtime panic if the Prometheus query returns no data.

 func ExtractSpecifiedValueFromMetricData4MemRSS(oc *CLI, metricResult string) (string, int) {
 	var ramMetricsInfo ContainerMemoryRSSType
 	jsonErr := json.Unmarshal([]byte(metricResult), &ramMetricsInfo)
 	o.Expect(jsonErr).NotTo(o.HaveOccurred())
+	o.Expect(len(ramMetricsInfo.Data.Result)).To(o.BeNumerically(">", 0), "Expected at least one result from Prometheus query")
 	e2e.Logf("Node: [%v], Pod Name: [%v], Status: [%v], Metric Name: [%v], Value: [%v]", ramMetricsInfo.Data.Result[0].Metric.Node, ramMetricsInfo.Data.Result[0].Metric.Pod, ramMetricsInfo.Status, ramMetricsInfo.Data.Result[0].Metric.MetricName, ramMetricsInfo.Data.Result[0].Value[1])
 	metricValue, err := strconv.Atoi(ramMetricsInfo.Data.Result[0].Value[1].(string))
 	o.Expect(err).NotTo(o.HaveOccurred())
 	return ramMetricsInfo.Data.Result[0].Metric.MetricName, metricValue
 }
test/util/azure_client_v2.go-276-283 (1)

276-283: Ignored error from client construction.

The error from armresources.NewResourceGroupsClient is ignored. If client creation fails, subsequent operations will panic with a nil pointer dereference.

 func (cs *AzureClientSet) CreateResourceGroup(ctx context.Context, resourceGroupName, region string) (armresources.ResourceGroupsClientCreateOrUpdateResponse, error) {
-	rgClient, _ := armresources.NewResourceGroupsClient(cs.SubscriptionID, cs.tokenCredential, nil)
+	rgClient, err := armresources.NewResourceGroupsClient(cs.SubscriptionID, cs.tokenCredential, nil)
+	if err != nil {
+		return armresources.ResourceGroupsClientCreateOrUpdateResponse{}, fmt.Errorf("failed to create resource group client: %w", err)
+	}
 	param := armresources.ResourceGroup{
 		Location: azTo.Ptr(region),
 	}
 	return rgClient.CreateOrUpdate(ctx, resourceGroupName, param, nil)
 }
test/util/azure_client_v2.go-285-304 (1)

285-304: Ignored errors from client/poller construction.

CreateStorageAccount ignores errors from NewAccountsClient and BeginCreate. These silent failures will cause confusing panics or incorrect behavior downstream.

 func (cs *AzureClientSet) CreateStorageAccount(ctx context.Context, resourceGroupName, storageAccountName, region string) (armstorage.AccountsClientListKeysResponse, error) {
-	storageClient, _ := armstorage.NewAccountsClient(cs.SubscriptionID, cs.tokenCredential, nil)
-	result, _ := storageClient.BeginCreate(ctx, resourceGroupName, storageAccountName, armstorage.AccountCreateParameters{
+	storageClient, err := armstorage.NewAccountsClient(cs.SubscriptionID, cs.tokenCredential, nil)
+	if err != nil {
+		return armstorage.AccountsClientListKeysResponse{}, fmt.Errorf("failed to create storage client: %w", err)
+	}
+	result, err := storageClient.BeginCreate(ctx, resourceGroupName, storageAccountName, armstorage.AccountCreateParameters{
 		Location: azTo.Ptr(region),
 		SKU: &armstorage.SKU{
 			Name: azTo.Ptr(armstorage.SKUNameStandardLRS),
 		},
 		Kind: azTo.Ptr(armstorage.KindStorageV2),
 	}, nil)
+	if err != nil {
+		return armstorage.AccountsClientListKeysResponse{}, fmt.Errorf("failed to begin storage account creation: %w", err)
+	}
test/util/aws_client.go-601-605 (1)

601-605: Potential panic if SecurityGroups slice is empty.

GetDefaultSecurityGroupByVpcID and GetSecurityGroupByGroupID access SecurityGroups[0] without checking slice length first.

 	ep, err := a.svc.DescribeSecurityGroups(&input)
 	if err != nil {
 		return nil, err
 	}
+	if len(ep.SecurityGroups) == 0 {
+		return nil, fmt.Errorf("no security group found with ID: %s", groupID)
+	}
 
 	return ep.SecurityGroups[0], nil

Also applies to: 633-637

test/util/framework.go-173-180 (1)

173-180: Logic error: isOCMProgressing is set to true when it should indicate completion.

The variable is initialized to true, then on line 178 it's set to true again when progressing is complete. This appears to be a bug — it should likely be set to false.

 			if condition.Status != operatorv1.ConditionFalse {
 				e2e.Logf("OCM rollout still progressing or in error: %v", condition.Status)
 				return false, nil
 			}
 			e2e.Logf("OCM rollout progressing status reports complete")
-			isOCMProgressing = true
+			isOCMProgressing = false
 			return true, nil
test/util/aws_client.go-80-86 (1)

80-86: Potential nil pointer dereference if Reservations[0].Instances[0] is empty.

While the code checks len(instanceInfo.Reservations) < 1, it doesn't verify that Instances slice is non-empty before accessing Instances[0].

 	if len(instanceInfo.Reservations) < 1 {
 		return "", &AWSInstanceNotFound{instanceName}
 	}
+	if len(instanceInfo.Reservations[0].Instances) < 1 {
+		return "", &AWSInstanceNotFound{instanceName}
+	}
 
 	instanceID := instanceInfo.Reservations[0].Instances[0].InstanceId

This pattern should be applied to other similar functions like GetAwsIntIPs, GetAwsInstanceState, GetAwsInstanceVPCId, etc.

test/util/gcloud_client.go-293-299 (1)

293-299: Potential index out of bounds panic in CreateExternalVPNGateway.

The function assumes vpnAddress slice has at least 4 elements. If a caller passes a shorter slice, this will panic at runtime.

 func (gcloud *Gcloud) CreateExternalVPNGateway(gatewayName string, vpnAddress []string) (createExternalVPNGateway []byte, err error) {
+	if len(vpnAddress) < 4 {
+		return nil, fmt.Errorf("vpnAddress must have at least 4 elements, got %d", len(vpnAddress))
+	}
 	createExternalVPNGateway, err = exec.Command("bash", "-c", fmt.Sprintf(`gcloud compute external-vpn-gateways create %s --interfaces 0=%s,1=%s,2=%s,3=%s`, gatewayName, vpnAddress[0], vpnAddress[1], vpnAddress[2], vpnAddress[3])).CombinedOutput()
test/util/framework.go-1026-1031 (1)

1026-1031: Potential resource leak: defer w.Stop() inside a loop.

In WaitForAnImageStream, the watcher is created inside an infinite loop with a deferred Stop(). If the outer loop iterates multiple times, watchers accumulate and are only stopped when the function returns.

 		w, err := client.Watch(context.Background(), metav1.ListOptions{...})
 		if err != nil {
 			return err
 		}
-		defer w.Stop()
 
 		for {
 			val, ok := <-w.ResultChan()
 			if !ok {
 				// reget and re-watch
+				w.Stop()
 				break
 			}
 			// ... handle events ...
 		}
+		w.Stop()

Consider restructuring to stop the watcher explicitly before restarting the loop.

Committable suggestion skipped: line range outside the PR's diff.

test/util/hypeshift_cluster.go-171-177 (1)

171-177: Fix error handling in GetHyperShiftHostedClusterNameSpace — check err, not namespace, for “resource type” errors

Currently:

namespace, err := oc.AsAdmin().WithoutNamespace().Run("get").Args(
    "hostedcluster", "-A", "--ignore-not-found", "-ojsonpath={.items[*].metadata.namespace}").Output()

if err != nil && !strings.Contains(namespace, "the server doesn't have a resource type") {
    o.Expect(err).NotTo(o.HaveOccurred(), "get hostedcluster fail: %v", err)
}

The “the server doesn't have a resource type” message will be in err.Error(), not in namespace (stdout), so this condition never does what the comment promises (“if not exist, it will return empty string”) and will instead fail the test whenever the CRD is missing.

Refactor to inspect the error text directly and return "" for the “no such resource type” case:

- namespace, err := oc.AsAdmin().WithoutNamespace().Run("get").Args(
-     "hostedcluster", "-A", "--ignore-not-found", "-ojsonpath={.items[*].metadata.namespace}").Output()
-
- if err != nil && !strings.Contains(namespace, "the server doesn't have a resource type") {
-     o.Expect(err).NotTo(o.HaveOccurred(), "get hostedcluster fail: %v", err)
- }
+ namespace, err := oc.AsAdmin().WithoutNamespace().Run("get").Args(
+     "hostedcluster", "-A", "--ignore-not-found", "-ojsonpath={.items[*].metadata.namespace}").Output()
+
+ if err != nil {
+     if strings.Contains(err.Error(), "the server doesn't have a resource type") {
+         e2e.Logf("hostedcluster resource type not found: %v", err)
+         return ""
+     }
+     o.Expect(err).NotTo(o.HaveOccurred(), "get hostedcluster fail: %v", err)
+ }

This matches the function’s documentation and avoids unexpected hard failures when hypershift CRDs are not installed.

Comment on lines +186 to +192
var skipShutdown bool = true
// skipShutdown parameter is optional, we are taking its true value here
future, vmErr := vmClient.PowerOff(ctx, resourceGroupName, vmName, &skipShutdown)
if err != nil {
e2e.Logf("cannot power off vm: %v", vmErr)
return osr, vmErr
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Bug: Wrong error variable checked in StopAzureVM.

Line 189 checks err (which is the function return parameter, currently nil) instead of vmErr from the PowerOff call.

 	var skipShutdown bool = true
 	// skipShutdown parameter is optional, we are taking its true value here
 	future, vmErr := vmClient.PowerOff(ctx, resourceGroupName, vmName, &skipShutdown)
-	if err != nil {
+	if vmErr != nil {
 		e2e.Logf("cannot power off vm: %v", vmErr)
 		return osr, vmErr
 	}
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
var skipShutdown bool = true
// skipShutdown parameter is optional, we are taking its true value here
future, vmErr := vmClient.PowerOff(ctx, resourceGroupName, vmName, &skipShutdown)
if err != nil {
e2e.Logf("cannot power off vm: %v", vmErr)
return osr, vmErr
}
var skipShutdown bool = true
// skipShutdown parameter is optional, we are taking its true value here
future, vmErr := vmClient.PowerOff(ctx, resourceGroupName, vmName, &skipShutdown)
if vmErr != nil {
e2e.Logf("cannot power off vm: %v", vmErr)
return osr, vmErr
}
🤖 Prompt for AI Agents
In test/util/azure_client.go around lines 186 to 192, the code checks the wrong
error variable after calling vmClient.PowerOff: it checks the function's named
return variable `err` instead of the `vmErr` returned by PowerOff. Change the
conditional to check `if vmErr != nil`, log `vmErr` in the e2e.Logf call, and
return `osr, vmErr` so the actual PowerOff error is propagated.

Comment on lines +464 to +492
// StopAzureStackVM stops the virtual machine with the given name in the specified resource group using Azure CLI
func StopAzureStackVM(resourceGroupName, vmName string) error {
cmd := fmt.Sprintf(`az vm stop --name %s --resource-group %s --no-wait`, vmName, resourceGroupName)
err := exec.Command("bash", "-c", cmd).Run()
if err != nil {
return fmt.Errorf("error stopping VM: %v", err)
}
return nil
}

// StartAzureStackVM starts the virtual machine with the given name in the specified resource group using Azure CLI
func StartAzureStackVM(resourceGroupName, vmName string) error {
cmd := fmt.Sprintf(`az vm start --name %s --resource-group %s`, vmName, resourceGroupName)
output, err := exec.Command("bash", "-c", cmd).Output()
if err != nil {
return fmt.Errorf("error starting VM: %v, output: %s", err, output)
}
return nil
}

// GetAzureStackVMStatus gets the status of the virtual machine with the given name in the specified resource group using Azure CLI
func GetAzureStackVMStatus(resourceGroupName, vmName string) (string, error) {
cmd := fmt.Sprintf(`az vm show --name %s --resource-group %s --query 'powerState' --show-details |awk '{print $2}' | cut -d '"' -f1`, vmName, resourceGroupName)
instanceState, err := exec.Command("bash", "-c", cmd).Output()
if string(instanceState) == "" || err != nil {
return "", fmt.Errorf("Not able to get vm instance state :: %s", err)
}
return strings.Trim(string(instanceState), "\n"), err
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Shell command injection vulnerability in Azure Stack functions.

The resourceGroupName and vmName parameters are directly interpolated into shell commands without sanitization. An attacker-controlled name could execute arbitrary commands.

-func StopAzureStackVM(resourceGroupName, vmName string) error {
-	cmd := fmt.Sprintf(`az vm stop --name %s --resource-group %s --no-wait`, vmName, resourceGroupName)
-	err := exec.Command("bash", "-c", cmd).Run()
+func StopAzureStackVM(resourceGroupName, vmName string) error {
+	cmd := exec.Command("az", "vm", "stop", "--name", vmName, "--resource-group", resourceGroupName, "--no-wait")
+	err := cmd.Run()

Apply similar fixes to StartAzureStackVM and GetAzureStackVMStatus.

Committable suggestion skipped: line range outside the PR's diff.

🤖 Prompt for AI Agents
In test/util/azure_client.go around lines 464 to 492, the shell commands build a
single bash string with fmt.Sprintf and unsanitized resourceGroupName and
vmName, which allows shell injection; fix StartAzureStackVM and
GetAzureStackVMStatus the same way as StopAzureStackVM should be fixed: stop
using "bash -c" and string interpolation and instead call exec.Command with the
az binary and each argument as a separate parameter (e.g. exec.Command("az",
"vm", "start", "--name", vmName, "--resource-group", resourceGroupName) and
include "--no-wait" for stop), and for GetAzureStackVMStatus avoid piping to
awk/cut by using az's machine-readable output (e.g. --query "powerState" -o tsv)
so you can parse the result in Go; also trim output, check for empty output, and
return wrapped errors with context instead of raw shell error messages.

Comment on lines +64 to +69
if state != "running" {
e2e.Logf("Boostrap instance's state: %s", state)
// If the found instance is not running, return a &InstanceNotFound error
// it can be used to skip the test case if no bootstrap is present
return nil, &InstanceNotFound{bootstrapInstanceID}
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🏁 Script executed:

# Search for GetAzureVMInstanceState function definition
rg "func GetAzureVMInstanceState" -A 30

Repository: openshift/cluster-version-operator

Length of output: 212


🌐 Web query:

Azure VM GetAzureVMInstanceState PowerState running state format

💡 Result:

Short answer: the InstanceView returns a status object where the PowerState appears as a code "PowerState/" (e.g. "PowerState/running") and a human displayStatus "VM running". To get it in PowerShell use -Status and read the Statuses array.

Example:

  • JSON snippet: { "code": "PowerState/running", "displayStatus": "VM running" }. [1]
  • PowerShell to extract display string:
    Get-AzVM -ResourceGroupName RG -Name VM -Status | Select -ExpandProperty Statuses | Where-Object { $_.Code -like 'PowerState/*' } | Select -ExpandProperty DisplayStatus
    (returns "VM running"). [1][2]

Sources: Azure docs (States and billing / Instance View) [1]; community examples (Get-AzureRmVM / Get-AzVM -Status extraction) [2].


Fix Azure VM state comparison — "running" will not match actual Azure state formats.

The comparison if state != "running" at line 64 is incorrect. Azure VM states are returned as either "PowerState/running" (code format) or "VM running" (display format), not as a bare "running" string. This causes running VMs to be incorrectly identified as not running, breaking the bootstrap validation logic. Verify the exact format returned by GetAzureVMInstanceState and update the comparison accordingly (e.g., using strings.Contains(state, "running") or comparing against the appropriate Azure state format).

🤖 Prompt for AI Agents
In test/util/bootstrap/azure.go around lines 64 to 69, the code compares the VM
state to the literal "running", which doesn't match Azure's returned formats
(e.g. "PowerState/running" or "VM running"); update the check to detect running
state robustly by normalizing or pattern-matching the returned string (for
example use strings.Contains(state, "running") or compare against the exact
format returned by GetAzureVMInstanceState), and ensure any necessary import is
added and tests adjusted if they assert exact state values.

if err != nil {
return false, err
}
return contains(imageList, imageIndex), nil
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Undefined function contains will cause compilation failure.

The function contains is called but not defined in this file or imported. This will result in a build error.

Add the missing helper function or use a standard approach:

+func contains(slice []string, item string) bool {
+	for _, s := range slice {
+		if strings.Contains(s, item) {
+			return true
+		}
+	}
+	return false
+}

Committable suggestion skipped: line range outside the PR's diff.

🤖 Prompt for AI Agents
In test/util/container/podman_client.go around line 189, the return statement
calls an undefined function `contains`, causing a compile error; replace it by
implementing a small helper that checks whether the given string slice contains
the target (or inline the check): add a private function like `func
contains(slice []string, s string) bool` that loops over slice and returns true
when an element equals s (false otherwise), or replace the call with an
equivalent loop at the call-site, and ensure the function is in the same package
and the return signature remains `(<bool>, nil)`.

Comment on lines +93 to +106
client := &http.Client{}
reqest, err := http.NewRequest("DELETE", endpoint, nil)
if strings.Compare(c.Authorization, "") != 0 {
reqest.Header.Add("Authorization", c.Authorization)
}

if err != nil {
return false, err
}
response, err := client.Do(reqest)
defer response.Body.Close()
if err != nil {
return false, err
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Fix nil pointer dereference and resource leak.

The defer response.Body.Close() on line 103 executes before checking if the request succeeded. If client.Do(reqest) returns an error, response may be nil, causing a panic.

Apply this diff:

 	client := &http.Client{}
-	reqest, err := http.NewRequest("DELETE", endpoint, nil)
+	request, err := http.NewRequest("DELETE", endpoint, nil)
 	if strings.Compare(c.Authorization, "") != 0 {
-		reqest.Header.Add("Authorization", c.Authorization)
+		request.Header.Add("Authorization", c.Authorization)
 	}
 
 	if err != nil {
 		return false, err
 	}
-	response, err := client.Do(reqest)
-	defer response.Body.Close()
+	response, err := client.Do(request)
 	if err != nil {
 		return false, err
 	}
+	defer response.Body.Close()
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
client := &http.Client{}
reqest, err := http.NewRequest("DELETE", endpoint, nil)
if strings.Compare(c.Authorization, "") != 0 {
reqest.Header.Add("Authorization", c.Authorization)
}
if err != nil {
return false, err
}
response, err := client.Do(reqest)
defer response.Body.Close()
if err != nil {
return false, err
}
client := &http.Client{}
request, err := http.NewRequest("DELETE", endpoint, nil)
if strings.Compare(c.Authorization, "") != 0 {
request.Header.Add("Authorization", c.Authorization)
}
if err != nil {
return false, err
}
response, err := client.Do(request)
if err != nil {
return false, err
}
defer response.Body.Close()
🤖 Prompt for AI Agents
In test/util/container/quay_client.go around lines 93 to 106, the code calls
http.NewRequest and then adds headers before checking the returned error, and it
defers response.Body.Close() immediately after client.Do(reqest) which can panic
if client.Do returns an error and response is nil; fix by checking the error
from http.NewRequest immediately and returning on error before touching reqest,
then call client.Do(reqest), check its err first, and only if response != nil
and err == nil defer response.Body.Close() (or explicitly close in the non-nil
success path) to avoid nil dereference and resource leak.

Comment on lines +200 to +212
var filterCmd = ""
if len(filter) > 0 {
filterCmd = " | grep -i " + filter
}
var filteredLogs []byte
var errCmd error
if combined {
filteredLogs, errCmd = exec.Command("bash", "-c", "cat "+podLogs+filterCmd).CombinedOutput()
} else {
filteredLogs, errCmd = exec.Command("bash", "-c", "cat "+podLogs+filterCmd).Output()
}
return string(filteredLogs), errCmd
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Shell command injection vulnerability via filter parameter.

The filter parameter is directly concatenated into a shell command without sanitization. A malicious filter like "; rm -rf /" could execute arbitrary commands.

Consider using Go's strings.Contains or regexp instead of shelling out to grep:

-	var filterCmd = ""
-	if len(filter) > 0 {
-		filterCmd = " | grep -i " + filter
-	}
-	var filteredLogs []byte
-	var errCmd error
-	if combined {
-		filteredLogs, errCmd = exec.Command("bash", "-c", "cat "+podLogs+filterCmd).CombinedOutput()
-	} else {
-		filteredLogs, errCmd = exec.Command("bash", "-c", "cat "+podLogs+filterCmd).Output()
-	}
-	return string(filteredLogs), errCmd
+	content, errCmd := os.ReadFile(podLogs)
+	if errCmd != nil {
+		return "", errCmd
+	}
+	if len(filter) == 0 {
+		return string(content), nil
+	}
+	var filtered []string
+	for _, line := range strings.Split(string(content), "\n") {
+		if strings.Contains(strings.ToLower(line), strings.ToLower(filter)) {
+			filtered = append(filtered, line)
+		}
+	}
+	return strings.Join(filtered, "\n"), nil
🤖 Prompt for AI Agents
In test/util/pods.go around lines 200 to 212, the code builds a bash -c command
by concatenating the user-provided filter into the shell string, which allows
shell injection; replace the shell-out approach with native Go file reading and
filtering: read the podLogs file into memory (os.ReadFile), and if filter is
non-empty apply safe filtering via strings.Contains (or regexp.Compile + Match)
on each line to build filteredLogs; handle the combined vs non-combined
semantics by returning either the full file or the filtered lines as a single
string and return any file/read/regexp errors directly instead of invoking bash.

Comment on lines +769 to +793
func IsOneMasterWithNWorkerNodes(oc *CLI) bool {

//Skip one master with 1-N worker nodes senario
topologyTypeStdOut, err := oc.AsAdmin().WithoutNamespace().Run("get").Args("infrastructure", "-ojsonpath={.items[*].status.infrastructureTopology}").Output()
o.Expect(err).NotTo(o.HaveOccurred())
o.Expect(topologyTypeStdOut).NotTo(o.BeEmpty())
topologyType := strings.ToLower(topologyTypeStdOut)

masterNodes, err := oc.AsAdmin().WithoutNamespace().Run("get").Args("nodes", "-l", "node-role.kubernetes.io/worker", "-oname").Output()
o.Expect(err).NotTo(o.HaveOccurred())
o.Expect(masterNodes).NotTo(o.BeEmpty())
masterNodesArr := strings.Split(masterNodes, "\n")

workerNodes, err := oc.AsAdmin().WithoutNamespace().Run("get").Args("nodes", "-l", "node-role.kubernetes.io/worker", "-oname").Output()
o.Expect(err).NotTo(o.HaveOccurred())
o.Expect(workerNodes).NotTo(o.BeEmpty())
workerNodesArr := strings.Split(workerNodes, "\n")
workerNums := len(workerNodesArr)

if workerNodesArr[0] == masterNodesArr[0] {
return topologyType == "singlereplica" && workerNums > 1
} else {
return topologyType == "singlereplica" && workerNums >= 1
}
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Bug: Both queries fetch worker nodes instead of master/worker.

The function queries for worker nodes twice, assigning both results incorrectly. The first query should fetch master nodes to match the function's intended behavior.

 func IsOneMasterWithNWorkerNodes(oc *CLI) bool {

 	//Skip one master with 1-N worker nodes senario
 	topologyTypeStdOut, err := oc.AsAdmin().WithoutNamespace().Run("get").Args("infrastructure", "-ojsonpath={.items[*].status.infrastructureTopology}").Output()
 	o.Expect(err).NotTo(o.HaveOccurred())
 	o.Expect(topologyTypeStdOut).NotTo(o.BeEmpty())
 	topologyType := strings.ToLower(topologyTypeStdOut)

-	masterNodes, err := oc.AsAdmin().WithoutNamespace().Run("get").Args("nodes", "-l", "node-role.kubernetes.io/worker", "-oname").Output()
+	masterNodes, err := oc.AsAdmin().WithoutNamespace().Run("get").Args("nodes", "-l", "node-role.kubernetes.io/master", "-oname").Output()
 	o.Expect(err).NotTo(o.HaveOccurred())
 	o.Expect(masterNodes).NotTo(o.BeEmpty())
 	masterNodesArr := strings.Split(masterNodes, "\n")

 	workerNodes, err := oc.AsAdmin().WithoutNamespace().Run("get").Args("nodes", "-l", "node-role.kubernetes.io/worker", "-oname").Output()
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
func IsOneMasterWithNWorkerNodes(oc *CLI) bool {
//Skip one master with 1-N worker nodes senario
topologyTypeStdOut, err := oc.AsAdmin().WithoutNamespace().Run("get").Args("infrastructure", "-ojsonpath={.items[*].status.infrastructureTopology}").Output()
o.Expect(err).NotTo(o.HaveOccurred())
o.Expect(topologyTypeStdOut).NotTo(o.BeEmpty())
topologyType := strings.ToLower(topologyTypeStdOut)
masterNodes, err := oc.AsAdmin().WithoutNamespace().Run("get").Args("nodes", "-l", "node-role.kubernetes.io/worker", "-oname").Output()
o.Expect(err).NotTo(o.HaveOccurred())
o.Expect(masterNodes).NotTo(o.BeEmpty())
masterNodesArr := strings.Split(masterNodes, "\n")
workerNodes, err := oc.AsAdmin().WithoutNamespace().Run("get").Args("nodes", "-l", "node-role.kubernetes.io/worker", "-oname").Output()
o.Expect(err).NotTo(o.HaveOccurred())
o.Expect(workerNodes).NotTo(o.BeEmpty())
workerNodesArr := strings.Split(workerNodes, "\n")
workerNums := len(workerNodesArr)
if workerNodesArr[0] == masterNodesArr[0] {
return topologyType == "singlereplica" && workerNums > 1
} else {
return topologyType == "singlereplica" && workerNums >= 1
}
}
func IsOneMasterWithNWorkerNodes(oc *CLI) bool {
//Skip one master with 1-N worker nodes senario
topologyTypeStdOut, err := oc.AsAdmin().WithoutNamespace().Run("get").Args("infrastructure", "-ojsonpath={.items[*].status.infrastructureTopology}").Output()
o.Expect(err).NotTo(o.HaveOccurred())
o.Expect(topologyTypeStdOut).NotTo(o.BeEmpty())
topologyType := strings.ToLower(topologyTypeStdOut)
masterNodes, err := oc.AsAdmin().WithoutNamespace().Run("get").Args("nodes", "-l", "node-role.kubernetes.io/master", "-oname").Output()
o.Expect(err).NotTo(o.HaveOccurred())
o.Expect(masterNodes).NotTo(o.BeEmpty())
masterNodesArr := strings.Split(masterNodes, "\n")
workerNodes, err := oc.AsAdmin().WithoutNamespace().Run("get").Args("nodes", "-l", "node-role.kubernetes.io/worker", "-oname").Output()
o.Expect(err).NotTo(o.HaveOccurred())
o.Expect(workerNodes).NotTo(o.BeEmpty())
workerNodesArr := strings.Split(workerNodes, "\n")
workerNums := len(workerNodesArr)
if workerNodesArr[0] == masterNodesArr[0] {
return topologyType == "singlereplica" && workerNums > 1
} else {
return topologyType == "singlereplica" && workerNums >= 1
}
}
🤖 Prompt for AI Agents
In test/util/psap.go around lines 769 to 793, the first node query incorrectly
fetches worker nodes instead of master nodes; change that call to query master
nodes (use the master role label, e.g. "node-role.kubernetes.io/master") so
masterNodes contains masters, and then trim the command output
(strings.TrimSpace) before splitting into arrays to avoid empty trailing
elements; keep the second query as the worker nodes query and then use the
corrected masterNodesArr and workerNodesArr for the topology checks.

Comment on lines +28 to +45
// RandStrCustomize This is a utility function for random strings n: random strings' length, s: Customizable String Sets
func RandStrCustomize(s string, n int) string {
var src = rand.NewSource(time.Now().UnixNano())
b := make([]byte, n)
// A rand.Int63() generates 63 random bits, enough for letterIdMax letters!
for i, cache, remain := n-1, src.Int63(), letterIDMax; i >= 0; {
if remain == 0 {
cache, remain = src.Int63(), letterIDMax
}
if idx := int(cache & letterIDMask); idx < len(s) {
b[i] = s[idx]
i--
}
cache >>= letterIDBits
remain--
}
return *(*string)(unsafe.Pointer(&b))
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Critical bug: empty string parameter causes infinite loop.

If s is an empty string, the condition at line 37 (idx < len(s)) will never be true, causing an infinite loop. Add validation at the start of the function.

Apply this diff to add input validation:

 func RandStrCustomize(s string, n int) string {
+	if len(s) == 0 {
+		panic("RandStrCustomize: string set cannot be empty")
+	}
+	if n < 0 {
+		panic("RandStrCustomize: length cannot be negative")
+	}
 	var src = rand.NewSource(time.Now().UnixNano())

Thread-safety issue: rand.Source is not safe for concurrent use.

Line 30 creates a rand.Source which is explicitly documented as not safe for concurrent use. If multiple test goroutines call these functions simultaneously, race conditions will occur.

Apply this diff to use a thread-safe approach:

 func RandStrCustomize(s string, n int) string {
-	var src = rand.NewSource(time.Now().UnixNano())
+	rng := rand.New(rand.NewSource(time.Now().UnixNano()))
 	b := make([]byte, n)
 	// A rand.Int63() generates 63 random bits, enough for letterIdMax letters!
-	for i, cache, remain := n-1, src.Int63(), letterIDMax; i >= 0; {
+	for i, cache, remain := n-1, rng.Int63(), letterIDMax; i >= 0; {
 		if remain == 0 {
-			cache, remain = src.Int63(), letterIDMax
+			cache, remain = rng.Int63(), letterIDMax
 		}

Note: rand.Rand (created via rand.New()) is safe for concurrent use.


Consider using standard string conversion instead of unsafe pointer.

Line 44 uses unsafe.Pointer to avoid copying the byte slice. For test utilities, the performance gain is negligible and the unsafe pattern is risky for future Go versions.

Apply this diff for safer code:

-	return *(*string)(unsafe.Pointer(&b))
+	return string(b)

This also allows removing the unsafe import.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
// RandStrCustomize This is a utility function for random strings n: random strings' length, s: Customizable String Sets
func RandStrCustomize(s string, n int) string {
var src = rand.NewSource(time.Now().UnixNano())
b := make([]byte, n)
// A rand.Int63() generates 63 random bits, enough for letterIdMax letters!
for i, cache, remain := n-1, src.Int63(), letterIDMax; i >= 0; {
if remain == 0 {
cache, remain = src.Int63(), letterIDMax
}
if idx := int(cache & letterIDMask); idx < len(s) {
b[i] = s[idx]
i--
}
cache >>= letterIDBits
remain--
}
return *(*string)(unsafe.Pointer(&b))
}
// RandStrCustomize This is a utility function for random strings n: random strings' length, s: Customizable String Sets
func RandStrCustomize(s string, n int) string {
if len(s) == 0 {
panic("RandStrCustomize: string set cannot be empty")
}
if n < 0 {
panic("RandStrCustomize: length cannot be negative")
}
rng := rand.New(rand.NewSource(time.Now().UnixNano()))
b := make([]byte, n)
// A rand.Int63() generates 63 random bits, enough for letterIdMax letters!
for i, cache, remain := n-1, rng.Int63(), letterIDMax; i >= 0; {
if remain == 0 {
cache, remain = rng.Int63(), letterIDMax
}
if idx := int(cache & letterIDMask); idx < len(s) {
b[i] = s[idx]
i--
}
cache >>= letterIDBits
remain--
}
return string(b)
}
🤖 Prompt for AI Agents
In test/util/randstr.go around lines 28 to 45, add validation to return an empty
string (or handle error) immediately if the input set s is empty to avoid the
infinite loop; replace the use of rand.NewSource() directly with a rand.Rand
instance (e.g. r := rand.New(rand.NewSource(time.Now().UnixNano()))) and use
r.Int63() so the generator is safe for concurrent use; and replace the unsafe
byte-slice-to-string conversion with a normal string(b) conversion and remove
the unsafe import.

Comment on lines +15 to +20
_, err := os.Stat(os.Getenv("SHARED_DIR") + "/cluster-id")
if err != nil {
if !os.IsExist(err) {
return false
}
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Incorrect file existence check logic.

os.IsExist(err) is almost never the correct check. It returns true only for specific "file already exists" errors (like from os.Mkdir when file exists). For checking if a file doesn't exist, use os.IsNotExist(err).

Current logic: if Stat fails and the error is NOT "file exists", return false. This is inverted.

 	_, err := os.Stat(os.Getenv("SHARED_DIR") + "/cluster-id")
 	if err != nil {
-		if !os.IsExist(err) {
+		if os.IsNotExist(err) {
 			return false
 		}
+		// For other errors (permissions, etc.), log and return false
+		return false
 	}
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
_, err := os.Stat(os.Getenv("SHARED_DIR") + "/cluster-id")
if err != nil {
if !os.IsExist(err) {
return false
}
}
_, err := os.Stat(os.Getenv("SHARED_DIR") + "/cluster-id")
if err != nil {
if os.IsNotExist(err) {
return false
}
// For other errors (permissions, etc.), log and return false
return false
}
🤖 Prompt for AI Agents
In test/util/rosa.go around lines 15–20, the file-existence check uses
os.IsExist(err) which is incorrect; change it to check os.IsNotExist(err) and
invert the logic so that if os.Stat returns an error and os.IsNotExist(err) is
true you return false (and for any other non-nil err either return false or
handle/log/propagate as appropriate). In short: replace the current nested check
with a direct os.IsNotExist(err) check and return false when the file does not
exist (and treat other errors consistently).

Comment on lines +14 to +27
func RetrieveDesiredComputeNodes(clusterDescription *ClusterDescription) (nodesNb int, err error) {
if clusterDescription.Nodes[0]["Compute (desired)"] != nil {
var isInt bool
nodesNb, isInt = clusterDescription.Nodes[0]["Compute (desired)"].(int)
if !isInt {
err = fmt.Errorf("'%v' is not an integer value")
}
} else {
// Try autoscale one
autoscaleInfo := clusterDescription.Nodes[0]["Compute (Autoscaled)"].(string)
nodesNb, err = strconv.Atoi(strings.Split(autoscaleInfo, "-")[0])
}
return
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Multiple critical bugs in this function.

  1. Line 15: No bounds check on Nodes[0] - will panic if Nodes is nil or empty.
  2. Line 19: Format string '%v' is not an integer value has no argument for %v.
  3. Line 23: Uses literal "Compute (Autoscaled)" but constant ClusterDescriptionComputeAutoscaled is "Compute (autoscaled)" (case mismatch).
  4. Line 23: Type assertion .(string) without ok check will panic if type doesn't match.
 func RetrieveDesiredComputeNodes(clusterDescription *ClusterDescription) (nodesNb int, err error) {
+	if clusterDescription == nil || len(clusterDescription.Nodes) == 0 {
+		return 0, fmt.Errorf("cluster description has no nodes data")
+	}
-	if clusterDescription.Nodes[0]["Compute (desired)"] != nil {
-		var isInt bool
-		nodesNb, isInt = clusterDescription.Nodes[0]["Compute (desired)"].(int)
+	if val := clusterDescription.Nodes[0][ClusterDescriptionComputeDesired]; val != nil {
+		var ok bool
+		nodesNb, ok = val.(int)
-		if !isInt {
-			err = fmt.Errorf("'%v' is not an integer value")
+		if !ok {
+			err = fmt.Errorf("'%v' is not an integer value", val)
 		}
 	} else {
 		// Try autoscale one
-		autoscaleInfo := clusterDescription.Nodes[0]["Compute (Autoscaled)"].(string)
-		nodesNb, err = strconv.Atoi(strings.Split(autoscaleInfo, "-")[0])
+		autoscaleVal, ok := clusterDescription.Nodes[0][ClusterDescriptionComputeAutoscaled].(string)
+		if !ok {
+			return 0, fmt.Errorf("no compute nodes info found")
+		}
+		nodesNb, err = strconv.Atoi(strings.Split(autoscaleVal, "-")[0])
 	}
 	return
 }

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants