Skip to content

AWS ParallelCluster v2.4.0

Compare
Choose a tag to compare
@lukeseawalker lukeseawalker released this 11 Jun 15:31
· 76 commits to master since this release
9dbff99

We're excited to announce the release of AWS ParallelCluster Node 2.4.0.

This is associated with AWS ParallelCluster v2.4.0.

Enhancements

  • Dynamically fetch compute instance type and cluster size in order to support updates
  • SGE:
    • process nodes added to or removed from the cluster in batches in order to speed up cluster scaling.
    • scale up only if required slots/nodes can be satisfied
    • scale down if pending jobs have unsatisfiable CPU/nodes requirements
    • add support for jobs in hold/suspended state (this includes job dependencies)
    • automatically terminate and replace faulty or unresponsive compute nodes
    • add retries in case of failures when adding or removing nodes
  • Slurm:
    • scale up only if required slots/nodes can be satisfied
    • scale down if pending jobs have unsatisfiable CPU/nodes requirements
    • automatically terminate and replace faulty or unresponsive compute nodes
  • Dump logs of replaced failing compute nodes to shared home directory

Changes

  • SQS messages that fail to be processed are re-queued only 3 times and not forever
  • Reset idletime to 0 when the host becomes essential for the cluster (because of min size of ASG or because there are
    pending jobs in the scheduler queue)
  • SGE: a node is considered as busy when in one of the following states "u", "C", "s", "d", "D", "E", "P", "o".
    This allows a quick replacement of the node without waiting for the nodewatcher to terminate it.

Bug Fixes

  • Slurm: add "BeginTime", "NodeDown", "Priority" and "ReqNodeNotAvail" to the pending reasons that trigger
    a cluster scaling
  • Add a timeout on remote commands execution so that the daemons are not stuck if the compute node is unresponsive
  • Fix an edge case that was causing the nodewatcher to hang forever in case the node had become essential to the
    cluster during a call to self_terminate.

Support

Need help / have a feature request?
AWS Support: https://console.aws.amazon.com/support/home
ParallelCluster Issues tracker on GitHub: https://github.com/aws/aws-parallelcluster
The HPC Forum on the AWS Forums page: https://forums.aws.amazon.com/forum.jspa?forumID=192