Skip to content

GCPManagedMachinePool scaling via MachinePool.spec.replicas is silently ignored #1656

@jonathanrainer

Description

@jonathanrainer

/kind bug

What steps did you take and what happened:

  1. Created a GCPManagedMachinePool without an explicit spec.scaling block.
  2. Waited for the node pool to reach RUNNING state and check the correct number of replicas was set in GCP.
  3. Edit MachinePool.spec.replicas to a different value.
  4. Observe: the controller reconciles repeatedly but the node pool size does not change.

You see a loop in the logs of 4 log lines

"Reconciling node pool resources"
"Node pool running"
"Node pool config update required" request="...resource_labels:{labels:{key:\"capg-cluster-<name>\" value:\"owned\"}}"
"Node pool config updating in progress"

What did you expect to happen:
The GCP nodepool would scale to the correct number of replicas

Anything else you would like to add:
There seem to be three issues that compound on each other to cause this to happen:

  1. There's a perpetual diff on resource_labels, so when the NodePool gets created the controller injects capg-cluster-<name> into NodeConfig.ResourceLabels. However, the semantics of this mean that that label is set on the individual VMs not on to the NodePool object itself. As such checkDiffAndPrepareUpdateConfig returns a diff each time and returns early, calling updateNodePool which is futile, because NodePool doesn't have any labels, and so it loops endlessly.
  • If this was to be fixed so it worked properly it would need an additional API call to get the underlying InstanceTemplate, check the labels on that and then report back rather than relying on checking the NodePool object itself
  1. In a related vein, ConvertToSdkLinuxNodeConfig, when called with a nil value, always produces a non-nil empty struct, which always differs from the result of the GKE API for pools that have no Linux Node config set. Which triggers the same bug.
  2. The three update checks that are part of checkDiffAndPrepareUpdateConfig each return early if they find a difference. So the first point, starves the autoscaling and size checks of ever running
  3. ConvertSdkAutoscaling, when called with a nil value, produces a struct with Enabled: true. As such, any size updates are skipped silently because the autoscaling guard returns early. As such if you want to do manual scaling, you're forced to explicitly set spec.scaling.enableAutoscaling: false.
    All of this compounds together to produce a constantly reconciling controller, and the ability to manually scale a nodepool effectively blocked.

Environment:

  • Cluster-api version: v0.26.0 (Operator)
  • Minikube/KIND version: N/A
  • Kubernetes version: (use kubectl version): 1.35.1
  • OS (e.g. from /etc/os-release): CoreOS

Metadata

Metadata

Assignees

No one assigned

    Labels

    kind/bugCategorizes issue or PR as related to a bug.

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions