Skip to content

Conversation

@gmarciani
Copy link
Contributor

@gmarciani gmarciani commented Oct 28, 2025

Description of changes

This PR mitigates the performance degradation reported in aws/aws-parallelcluster#6449

Add new chef attribute cluster/in_place_update_on_fleet_enabled to disable cfn-hup on compute and login nodes in order to improve performance at scale.
When cfn-hup is disabled on compute and login nodes, the cluster readiness checks executed by the head node are disabled.
Such attribute takes effect at config time (cluster creation/update, not at build image).

Q&A

  1. Why disabling cluster readiness check?
    Cluster readiness checks verify that all running compute/login nodes deployed the expected config version. If cfn-hup is disabled, they cannot apply the config version carried by the cluster update, so the check would always fail.

  2. Why not calling the attribute in_place_update_on_fleet_disabled?
    We decided to name it in_place_update_on_fleet_enabled rather than in_place_update_on_fleet_disabled because of consistency (we use positive attributes in the rest of the cookbook) and maintainability (positive attributes are less error prone, e.g. double negations)

  3. Why disabling cfnhup on login nodes if the source of perf degradation are only compute nodes?
    To provide a consistent user experience and implementation. If we keep cfnhup in login nodes we would end up having login nodes supporting in-place updates and compute nodes not supporting it, ultimately leading to potential confusion and complexity.

  4. Why not testing the update of the new attribute?
    Updates to ExtraChefAttributes have never been supported as per update policy here

User Experience

By default the attribute is true, so cfn-hup is enabled on all cluster nodes.
When set to false, cfn-hup is disabled on both compute and login nodes. When this the case, the cluster readiness checks are disabled because w.o cfnhup compute/login nodes are not able to start an in-place update, so such checks would always fail.

[UseCase 1] in-place updates enabled

This is the default behavior, where cfn-hup is enabled on head node, compute nodes and login nodes. Being cfn-hup enabled, compute/login nodes are able to execute in-place updates, so the head node executes the usual cluster readiness check at the end of the update.

[UseCase 2] in-place updates disabled

cfn-hup is enabled on head node, but disabled on both compute nodes and login nodes. Being cfn-hup disabled, compute/login nodes are not able to execute in-place updates, so the head node does not execute the cluster readiness check at the end of the update.

Tests

  • Unit tests (Existing and new ones)
  • Manually validated all the use cases reported in User Experience.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

@gmarciani gmarciani added the 3.x label Oct 28, 2025
@gmarciani gmarciani marked this pull request as ready for review October 28, 2025 22:19
@gmarciani gmarciani requested review from a team as code owners October 28, 2025 22:19
@gmarciani gmarciani force-pushed the wip/mgiacomo/3150/performance-disable-cfnhup-on-compute-nodes-1024-1 branch from b4acceb to 7e49541 Compare October 29, 2025 17:07
@gmarciani gmarciani changed the title Add chef attribute cluster/cfnhup_on_fleet_enabled to disable cfn-hup on compute and login nodes. Add chef attribute cluster/in_place_update_on_fleet_enabled to disable cfn-hup on compute and login nodes and improve performance at scale Oct 29, 2025
@gmarciani gmarciani force-pushed the wip/mgiacomo/3150/performance-disable-cfnhup-on-compute-nodes-1024-1 branch 3 times, most recently from aed529e to dedb84e Compare October 29, 2025 18:28
…abled` to disable in-place updates on compute and login nodes by disabling cfn-hup on those nodes.

As a consequence, it also disables the cluster readiness checks executed by the head node on cluster update.

Disabling cfn-hup mitigates a relevant performance degradation that may occur with tightly coupled workload st scale.
@gmarciani gmarciani force-pushed the wip/mgiacomo/3150/performance-disable-cfnhup-on-compute-nodes-1024-1 branch from dedb84e to 61fab33 Compare October 29, 2025 18:30
@himani2411
Copy link
Contributor

I would suggest that when you describe the user experience you refrain from mentioning [UseCase 1] cfn-hup enabled or [UseCase 1] cfn-hup disabled and explain what the use of cfn-hup is, just like what cluster_readiness_check is explained

@gmarciani
Copy link
Contributor Author

I would suggest that when you describe the user experience you refrain from mentioning [UseCase 1] cfn-hup enabled or [UseCase 1] cfn-hup disabled and explain what the use of cfn-hup is, just like what cluster_readiness_check is explained

Done, both here and in the PR for the CLI aws/aws-parallelcluster#7071

@gmarciani gmarciani enabled auto-merge (rebase) October 29, 2025 21:35
Copy link
Contributor

@himani2411 himani2411 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@gmarciani gmarciani merged commit 6eda378 into aws:develop Oct 30, 2025
26 of 30 checks passed
@gmarciani gmarciani deleted the wip/mgiacomo/3150/performance-disable-cfnhup-on-compute-nodes-1024-1 branch October 30, 2025 16:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants