Skip to content

Conversation

hors
Copy link
Collaborator

@hors hors commented Sep 12, 2025

K8SPSMDB-1466 Powered by Pull Request Badge

CHANGE DESCRIPTION

Description

This PR fixes the issue where the Percona Server MongoDB Operator would crash when encountering errors during Multi-Cluster Services (MCS) discovery, specifically the "stale GroupVersion discovery" error.

Problem

The operator was crashing with the following error:

This occurred in the MCS registration process when dc.ServerPreferredResources() failed, causing the entire operator pod to crash and restart.

Solution

  • Graceful Error Handling: Modified the Register() function in pkg/mcs/register.go to handle any discovery error gracefully
  • Mark MCS Unavailable: Instead of crashing, the operator now marks MCS as unavailable and continues normal operation
  • Comprehensive Testing: Added unit tests to ensure MCS functionality works correctly

New Unit Tests in pkg/mcs/register_test.go

  • TestIsAvailable(): Tests MCS availability status
  • TestMCSSchemeGroupVersion(): Tests scheme group version initialization
  • TestServiceExport(): Tests ServiceExport object creation
  • TestServiceExportList(): Tests ServiceExportList object creation

Benefits

  • Prevents Operator Crashes: Operator no longer crashes on MCS discovery errors
  • Graceful Degradation: MCS functionality is marked as unavailable but operator continues working
  • Improved Reliability: Operator is more resilient to external service issues
  • Comprehensive Testing: MCS functionality is now properly tested
  • Backward Compatible: No breaking changes to existing functionality

Testing

  • Unit tests pass for MCS functionality
  • Operator starts successfully without MCS available
  • Operator continues normal operation when MCS discovery fails
  • No regression in existing functionality

Type of Change

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Code quality improvement
  • Test coverage improvement

Checklist

  • Code follows the project's coding standards
  • Self-review of the code has been performed
  • Code has been commented, particularly in hard-to-understand areas
  • Unit tests have been added/updated
  • No new warnings or errors introduced
  • Changes are backward compatible

Related Issues

Fixes the operator crash issue when MCS discovery fails with "stale GroupVersion discovery" error.

Additional Notes

This fix ensures that MCS (Multi-Cluster Services) is treated as an optional feature. When MCS discovery fails for any reason, the operator gracefully marks MCS as unavailable and continues with normal MongoDB cluster operations. This makes the operator more resilient and prevents unnecessary pod restarts.

@pull-request-size pull-request-size bot added the size/M 30-99 lines label Sep 12, 2025
@JNKPercona
Copy link
Collaborator

Test name Status
arbiter passed
balancer passed
cross-site-sharded passed
custom-replset-name passed
custom-tls passed
custom-users-roles passed
custom-users-roles-sharded passed
data-at-rest-encryption passed
data-sharded passed
demand-backup passed
demand-backup-eks-credentials-irsa passed
demand-backup-fs passed
demand-backup-incremental passed
demand-backup-incremental-sharded passed
demand-backup-physical-parallel passed
demand-backup-physical-aws passed
demand-backup-physical-azure passed
demand-backup-physical-gcp passed
demand-backup-physical-minio passed
demand-backup-physical-sharded-parallel passed
demand-backup-physical-sharded-aws passed
demand-backup-physical-sharded-azure passed
demand-backup-physical-sharded-gcp passed
demand-backup-physical-sharded-minio passed
demand-backup-sharded passed
expose-sharded passed
finalizer passed
ignore-labels-annotations passed
init-deploy passed
ldap passed
ldap-tls passed
limits passed
liveness passed
mongod-major-upgrade passed
mongod-major-upgrade-sharded passed
monitoring-2-0 passed
monitoring-pmm3 passed
multi-cluster-service passed
multi-storage passed
non-voting-and-hidden passed
one-pod passed
operator-self-healing-chaos passed
pitr passed
pitr-physical passed
pitr-sharded passed
pitr-physical-backup-source passed
preinit-updates passed
pvc-resize passed
recover-no-primary passed
replset-overrides passed
rs-shard-migration passed
scaling passed
scheduled-backup passed
security-context passed
self-healing-chaos passed
service-per-pod passed
serviceless-external-nodes passed
smart-update passed
split-horizon passed
stable-resource-version passed
storage passed
tls-issue-cert-manager passed
upgrade passed
upgrade-consistency passed
upgrade-consistency-sharded-tls failure
upgrade-sharded passed
users passed
version-service passed
We run 68 out of 68

commit: e77ba0f
image: perconalab/percona-server-mongodb-operator:PR-2044-e77ba0f9

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
size/M 30-99 lines
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants