Releases · aws/aws-parallelcluster-node · GitHub

29 Jul 10:32

demartinofra

AWS ParallelCluster v2.4.1

We're excited to announce the release of AWS ParallelCluster Node 2.4.1.

This is associated with AWS ParallelCluster v2.4.1.

Enhancements

Torque:
- process nodes added to or removed from the cluster in batches in order to speed up cluster scaling.
- scale up only if required slots/nodes can be satisfied
- scale down if pending jobs have unsatisfiable CPU/nodes requirements
- add support for jobs in hold/suspended state (this includes job dependencies)
- automatically terminate and replace faulty or unresponsive compute nodes
- add retries in case of failures when adding or removing nodes
- add support for ncpus reservation and multi nodes resource allocation (e.g. -l nodes=2:ppn=3+3:ppn=6)

Changes

Drop support for Python 2. Node daemons now support Python >= 3.5.
Torque: trigger a scheduling cycle every 1 minute when there are pending jobs in the queue. This is done in order
to speed up jobs scheduling with a dynamic cluster size.

Bug Fixes

Restore logic that was automatically adding compute nodes identity to known_hosts file.
Slurm: fix issue that was causing the daemons to fail when the cluster is stopped and an empty compute nodes file
is imported in Slurm config.
Torque: fix command to disable hosts in the scheduler before termination.

Support

Need help / have a feature request?
AWS Support: https://console.aws.amazon.com/support/home
ParallelCluster Issues tracker on GitHub: https://github.com/aws/aws-parallelcluster
The HPC Forum on the AWS Forums page: https://forums.aws.amazon.com/forum.jspa?forumID=192

Assets 2

11 Jun 15:31

AWS ParallelCluster v2.4.0

We're excited to announce the release of AWS ParallelCluster Node 2.4.0.

This is associated with AWS ParallelCluster v2.4.0.

Enhancements

Dynamically fetch compute instance type and cluster size in order to support updates
SGE:
- process nodes added to or removed from the cluster in batches in order to speed up cluster scaling.
- scale up only if required slots/nodes can be satisfied
- scale down if pending jobs have unsatisfiable CPU/nodes requirements
- add support for jobs in hold/suspended state (this includes job dependencies)
- automatically terminate and replace faulty or unresponsive compute nodes
- add retries in case of failures when adding or removing nodes
Slurm:
- scale up only if required slots/nodes can be satisfied
- scale down if pending jobs have unsatisfiable CPU/nodes requirements
- automatically terminate and replace faulty or unresponsive compute nodes
Dump logs of replaced failing compute nodes to shared home directory

Changes

SQS messages that fail to be processed are re-queued only 3 times and not forever
Reset idletime to 0 when the host becomes essential for the cluster (because of min size of ASG or because there are
pending jobs in the scheduler queue)
SGE: a node is considered as busy when in one of the following states "u", "C", "s", "d", "D", "E", "P", "o".
This allows a quick replacement of the node without waiting for the nodewatcher to terminate it.

Bug Fixes

Slurm: add "BeginTime", "NodeDown", "Priority" and "ReqNodeNotAvail" to the pending reasons that trigger
a cluster scaling
Add a timeout on remote commands execution so that the daemons are not stuck if the compute node is unresponsive
Fix an edge case that was causing the nodewatcher to hang forever in case the node had become essential to the
cluster during a call to self_terminate.

Support

Need help / have a feature request?
AWS Support: https://console.aws.amazon.com/support/home
ParallelCluster Issues tracker on GitHub: https://github.com/aws/aws-parallelcluster
The HPC Forum on the AWS Forums page: https://forums.aws.amazon.com/forum.jspa?forumID=192

Assets 2

03 Apr 09:00

enrico-usai

AWS ParallelCluster 2.3.1

We're excited to announce the release of AWS ParallelCluster Node 2.3.1.

This is associated with AWS ParallelCluster v2.3.1.

Changes

sqswatcher: Slurm - dynamically adjust max cluster size based on ASG settings
sqswatcher: Slurm - use FUTURE state for dummy nodes to prevent Slurm daemon from contacting unexisting nodes
sqswatcher: Slurm - dynamically change the number of configured FUTURE nodes based on the actual nodes that join the cluster. The max size of the cluster seen by the scheduler always matches the max capacity of the ASG.
sqswatcher: Slurm - process nodes added to or removed from the cluster in batches. This speeds up cluster scaling which is able to react with a delay of less than 1 minute to variations in the ASG capacity.
sqswatcher: Slurm - add support for job dependencies and pending reasons. The cluster won't scale up if the job cannot start due to an unsatisfied dependency.
Slurm - set ReturnToService=1 in scheduler config in order to recover instances that were initially marked as down due to a transient issue.
sqswatcher: remove DynamoDB table creation
improve and standardize shell command execution
add retries on failures and exceptions

Bug Fixes

sqswatcher: Slurm - set compute nodes to DRAIN state before removing them from cluster. This prevents the scheduler from submitting a job to a node that is being terminated.
sqswatcher: Slurm - Fix host removal

Support

Need help / have a feature request?
AWS Support: https://console.aws.amazon.com/support/home
ParallelCluster Issues tracker on GitHub: https://github.com/aws/aws-parallelcluster
The HPC Forum on the AWS Forums page: https://forums.aws.amazon.com/forum.jspa?forumID=192

Assets 2

28 Feb 13:47

demartinofra

AWS ParallelCluster 2.2.1

We're excited to announce the release of AWS ParallelCluster Node 2.2.1.

This is associated with AWS ParallelCluster v2.2.1.

Features

Support for FSx Lustre with Centos 7
Check AWS EC2 account limits before starting cluster creation
Allow users to force job deletion with SGE scheduler

Changes

Set default value to compute for placement_group option
pcluster ssh: use private IP when the public one is not available
pcluster ssh: now works also when stack is not completed as long as the master IP is available

Bugfixes

awsbsub: fix file upload with absolute path
pcluster ssh: fix issue that was preventing the command from working correctly when stack status is UPDATE_ROLLBACK_COMPLETE
Fix block device conversion to correctly attach EBS nvme volumes
Wait for Torque scheduler initialization before completing master node setup
pcluster version: now works also when no ParallelCluster config is present
Improve nodewatcher daemon logic to detect if a SGE compute node has running jobs

Support

Need help / have a feature request?
AWS Support: https://console.aws.amazon.com/support/home
ParallelCluster Issues tracker on GitHub: https://github.com/aws/aws-parallelcluster
The HPC Forum on the AWS Forums page: https://forums.aws.amazon.com/forum.jspa?forumID=192

Assets 2

08 Jan 14:38

AWS ParallelCluster 2.1.1

We're excited to announce the release of AWS ParallelCluster Node 2.1.1.

This is associated with AWS ParallelCluster v2.1.1.

Features

Support for AWS Beijing Region (cn-north-1) and Ningxia Region (cn-northwest-1

Bugfixes

No longer schedule jobs on compute nodes that are terminating

Support

Need help / have a feature request?
AWS Support: https://console.aws.amazon.com/support/home
ParallelCluster Issues tracker on GitHub: https://github.com/aws/aws-parallelcluster
The HPC Forum on the AWS Forums page: https://forums.aws.amazon.com/forum.jspa?forumID=192

Assets 2

18 Dec 01:14

sean-smith

AWS ParallelCluster v2.1.0

We're excited to announce the release of AWS ParallelCluster Node 2.1.0!

This is associated with AWS ParallelCluster v2.1.0.

Features

Support for Elastic File System (EFS)
AWS Batch Multinode Parallel support
Support for RAID 0 and 1 EBS Volumes
Support for AWS Stockholm Region (eu-north-1)

Bugfixes

No longer schedule jobs on compute nodes that are terminating

Support

Need help / have a feature request?
AWS Support: https://console.aws.amazon.com/support/home
ParallelCluster Issues tracker on GitHub: https://github.com/aws/aws-parallelcluster note: we've moved node issues to the main package, please create new issues there
The HPC Forum on the AWS Forums page: https://forums.aws.amazon.com/forum.jspa?forumID=192

Assets 2

20 Nov 00:40

sean-smith

AWS ParallelCluster v2.0.2

We're excited to announce the release of AWS ParallelCluster Node 2.0.2!

This is associated with AWS ParallelCluster v2.0.2.

Features

Support for new GovCloud region us-gov-east-1

Bugfixes

Fix regression with shared_dir parameter in the cluster configuration section.
Fixed issue with jq that prevented customers from using extra_json
Fixed issue with awscli version on ubuntu1404

Support

Need help / have a feature request?
AWS Support: https://console.aws.amazon.com/support/home
ParallelCluster Issues tracker on GitHub: https://github.com/aws/aws-parallelcluster note: we've moved node issues to the main package, please create new issues there
The HPC Forum on the AWS Forums page: https://forums.aws.amazon.com/forum.jspa?forumID=192

Assets 2

20 Nov 00:38

sean-smith

AWS ParallelCluster v2.0.0

We're excited to announce the release of AWS ParallelCluster Node 2.0.0!

This is associated with AWS ParallelCluster v2.0.0.

Features

AWS Batch Integration
Support for creating custom AMI's
Multiple EBS Volume support

Support

Need help / have a feature request?
AWS Support: https://console.aws.amazon.com/support/home
ParallelCluster Issues tracker on GitHub: https://github.com/aws/aws-parallelcluster note: we've moved node issues to the main package, please create new issues there
The HPC Forum on the AWS Forums page: https://forums.aws.amazon.com/forum.jspa?forumID=192

Assets 2

26 Oct 00:10

sean-smith

CfnCluster v1.6.0

Bug fixes/minor improvements:

Changed scaling functionality to scale up and scale down faster.

Assets 2

30 Aug 15:05

sean-smith

CfnCluster v1.5.4

Bug fixes/minor improvements:

Upgraded Boto2 to Boto3 package.

Assets 2