-
Notifications
You must be signed in to change notification settings - Fork 763
Add support for latest-generation Google Cloud machine families and boot disk type configuration #6616
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
docs: update Google Batch documentation to include bootDiskType option fix: ensure compatibility with machine families requiring Hyperdisk test(nf-google): add unit tests for boot disk type configurations chore: update .gitignore to exclude 'mise.toml' build: update build-info properties for nextflow module Signed-off-by: Sofiane Ihaddadene <[email protected]>
✅ Deploy Preview for nextflow-docs-staging ready!
To edit notification comments on pull requests, go to your Netlify project configuration. |
christopher-hakkaart
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can't comment on the code. But the docs are well written and no objections from me.
jorgee
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have done some tests and I suggest not allowing the setting of the boot disk type. It requires a strict instance-boot-disk-type validation, as any instance type has a different set of supported disks.
A user likely defines a boot disk type without defining the instance type, and in this case, jobs are stuck at scheduled state because the disk type is not supported by the instance selected in GoogleBatchTaskHandler.findBestMachineType.
To have proper boot disk type support, the MachineTypeSelector must be aware of what disk types are supported per instance, and this is not the case in the current PR. If the goal of the PR is just allow the use of the new instance types, it is better to just set the hyperdisk for those instances that do not support the default disk type.
| * Families that only support Hyperdisk (no standard PD) | ||
| * LAST UPDATE 2024-05-22 | ||
| */ | ||
| private static final List<String> GENERAL_PURPOSE_FAMILIES = ['c4-*', 'c4a-*', 'c4d-*', 'n4-*', 'n4d-*', 'n4a-*'] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe rename as HYPERDISK_ONLY_FAMILIES.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @jorgee for the feedback and testing. I agree—adding the boot disk type configuration adds too much complexity right now regarding validation.
I will remove the bootDiskType feature from this PR and stick to just enabling the new instance types (automatically setting Hyperdisk where needed). I'll handle the generic boot disk configuration in a separate, dedicated PR later to ensure the MachineTypeSelector logic is robust.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will also rename the constant to HYPERDISK_ONLY_FAMILIES as suggested, as it is much more descriptive.
Add support for latest-generation Google Cloud machine families and boot disk type configuration
Problem
Two critical limitations prevented full utilization of Google Cloud Batch capabilities:
1. Missing support for latest-generation machine families
Google Cloud has introduced several new general-purpose machine families that are not currently supported by Nextflow:
These families offer significant improvements in:
Without this support, users cannot leverage:
2. Inability to specify boot disk type
Currently, Nextflow only allows configuring the boot disk size via
google.batch.bootDiskSize, but not the disk type. This creates several issues:Compatibility problems:
pd-balanceddisks (the Google Cloud default)hyperdisk-balancedorpd-ssdPerformance optimization:
pd-ssd(higher IOPS)pd-standard(lower cost)Reference: Google Cloud Disk Types Documentation
Solution
This PR addresses both issues with a comprehensive solution:
1. Add support for latest-generation machine families
Machine type recognition:
GENERAL_PURPOSE_FAMILIES-lssdsuffixTesting:
2. Add
bootDiskTypeconfiguration optionNew configuration parameter:
google { project = 'your-project-id' location = 'us-central1' batch { bootDiskSize = '50 GB' bootDiskType = 'hyperdisk-balanced' // NEW: Specify disk type } }Supported disk types (Google Cloud documentation):
pd-standard- Standard persistent disk (HDD, lowest cost)pd-balanced- Balanced persistent disk (SSD, default for most instances)pd-ssd- SSD persistent disk (highest performance)hyperdisk-balanced- Hyperdisk balanced (required for C4/N4 families)Key features:
bootDiskImagewhen both are specifiedChanges
Core Implementation
GoogleBatchMachineTypeSelector.groovyGENERAL_PURPOSE_FAMILIESconstant for C4/N4 family detectionisHyperdiskOnly()method to identify families requiring HyperdiskfindValidLocalSSDSize()to handle C4/C4D local SSD variantsBatchConfig.groovybootDiskTypefield with@ConfigOptionannotationGoogleBatchTaskHandler.groovybootDiskTypewhen specifiedbootDiskTypeis used with instance templatesTests
GoogleBatchMachineTypeSelectorTest.groovyisHyperdiskOnly()behavior for all new familiesBatchConfigTest.groovybootDiskTypeparsing from configurationbootDiskTypecombined with other boot disk optionsGoogleBatchTaskHandlerTest.groovyDocumentation
docs/reference/config.mdgoogle.batch.bootDiskTypeto configuration referencedocs/google.mdCompatibility
Testing
All tests pass:
bootDiskTypeconfigurationTest coverage:
Use Cases Enabled
1. Using latest-generation machines:
process myTask { machineType 'c4-standard-4' // Now works! memory '16 GB' script: """ # High-performance workload on latest Intel Sapphire Rapids """ }2. Optimizing for cost:
3. Optimizing for performance:
4. Using new machine families:
process highPerf { machineType 'c4a-standard-8' // AMD EPYC Genoa script: """ # Requires hyperdisk-balanced or pd-ssd """ } google.batch.bootDiskType = 'hyperdisk-balanced' // Compatible with C4AReferences