Skip to content

ne0ARCTICne30x4 test fails with 1280 tasks and fixed CTSM, because of too few processors #1332

@ekluzek

Description

@ekluzek

What happened?

I was testing some CAM cases with a fixed CTSM version for ne0ARCTICne30x4 and noticed this test:

SMS_D_Ln9_P1280x1.ne0ARCTICne30x4_ne0ARCTICne30x4_mt12.FHIST.derecho_intel.cam-outfrq9s

still fails. But, it looks like it has too few processors, because increasing it to 5120 or to the default PE layout for this grid does work.

What are the steps to reproduce the bug?

Use, what will be ctsm5.3.059 (ESCOMP/CTSM#2950) in cesm3_0_alpha07a and run the above test.

It looks like there are a couple tests in the testlist that use 1280 tasks and they just need to be bumped up to a higher task count.

What CAM tag were you using?

cam6_4_089

What machine were you running CAM on?

CISL machine (e.g. cheyenne)

What compiler were you using?

Intel

Path to a case directory, if applicable

No response

Will you be addressing this bug yourself?

Any CAM SE can do this

Extra info

These two tests both work:

SMS_D_Ln9.ne0ARCTICne30x4_ne0ARCTICne30x4_mt12.FHIST.derecho_intel.cam-outfrq9s
SMS_D_Ln9_P5120x1.ne0ARCTICne30x4_ne0ARCTICne30x4_mt12.FHIST.derecho_intel.cam-outfrq9s

However, the one with 1280 tasks fails with the following in the cesm log files. It looks like it's failing in an ESMF regrid operation in CTSM, so it likely just ran out of memory.

cesm.log:

dec0225.hsn.de.hpc.ucar.edu 0:  (t_initf)       profile_single_file=      F
dec0225.hsn.de.hpc.ucar.edu 0:  (t_initf)       profile_global_stats=     T
dec0225.hsn.de.hpc.ucar.edu 0:  (t_initf)       profile_ovhd_measurement= F
dec0225.hsn.de.hpc.ucar.edu 0:  (t_initf)       profile_add_detail=       F
dec0225.hsn.de.hpc.ucar.edu 0:  (t_initf)       profile_papi_enable=      F
dec0233.hsn.de.hpc.ucar.edu 463: forrtl: error (65): floating invalid
dec0233.hsn.de.hpc.ucar.edu 463: Image              PC                Routine            Line        Source             
dec0233.hsn.de.hpc.ucar.edu 463: libpthread-2.31.s  00001489D4B598C0  Unknown               Unknown  Unknown
dec0233.hsn.de.hpc.ucar.edu 463: libesmf.so         00001489DC2C539D  exec_psssDstRra<d        6567  ESMCI_DELayout.C
dec0233.hsn.de.hpc.ucar.edu 463: libesmf.so         00001489DC2AAC5B  psssDstRra<double        6539  ESMCI_DELayout.C
dec0233.hsn.de.hpc.ucar.edu 463: libesmf.so         00001489DC2A6271  psssDstRra<double        6503  ESMCI_DELayout.C
dec0233.hsn.de.hpc.ucar.edu 463: libesmf.so         00001489DC2913CE  psssDstRra<double        6463  ESMCI_DELayout.C
dec0233.hsn.de.hpc.ucar.edu 463: libesmf.so         00001489DC258D2B  psssDstRra<int, i        6423  ESMCI_DELayout.C
dec0233.hsn.de.hpc.ucar.edu 463: libesmf.so         00001489DC247802  exec                     4842  ESMCI_DELayout.C
dec0233.hsn.de.hpc.ucar.edu 463: libesmf.so         00001489DC246955  exec                     4410  ESMCI_DELayout.C
dec0233.hsn.de.hpc.ucar.edu 463: libesmf.so         00001489DC0F989E  sparseMatMulStore       11399  ESMCI_Array.C
dec0233.hsn.de.hpc.ucar.edu 463: libesmf.so         00001489DC0EDEC1  tSparseMatMulStor        9603  ESMCI_Array.C
dec0233.hsn.de.hpc.ucar.edu 463: libesmf.so         00001489DC0EAD83  sparseMatMulStore        8896  ESMCI_Array.C
dec0233.hsn.de.hpc.ucar.edu 463: libesmf.so         00001489DC1B536F  c_esmc_arraysmmst        1105  ESMCI_Array_F.C
dec0233.hsn.de.hpc.ucar.edu 463: libesmf.so         00001489DC7E0402  ESMCI_regrid_crea         639  ESMCI_Mesh_Regrid_Glue.C
dec0233.hsn.de.hpc.ucar.edu 463: libesmf.so         00001489DC7280A4  regrid_create            1658  ESMCI_MeshCap.C
dec0233.hsn.de.hpc.ucar.edu 463: libesmf.so         00001489DC84A0E6  c_esmc_regrid_cre          93  ESMCI_Regrid_F.C
dec0233.hsn.de.hpc.ucar.edu 463: libesmf.so         00001489DDA2E211  c_esmc_regrid_cre           0  ESMF_Regrid.F90
dec0233.hsn.de.hpc.ucar.edu 463: libesmf.so         00001489DDA2C69B  esmf_regridstore          360  ESMF_Regrid.F90
dec0233.hsn.de.hpc.ucar.edu 463: libesmf.so         00001489DD32A747  esmf_fieldregrids        1238  ESMF_FieldRegrid.F90
dec0233.hsn.de.hpc.ucar.edu 463: cesm.exe           00000000080CE668  lnd_set_decomp_an         497  lnd_set_decomp_and_domain.F90
dec0233.hsn.de.hpc.ucar.edu 463: cesm.exe           00000000080C6E22  lnd_set_decomp_an         128  lnd_set_decomp_and_domain.F90
dec0233.hsn.de.hpc.ucar.edu 463: cesm.exe           0000000008098E13  lnd_comp_nuopc_mp         645  lnd_comp_nuopc.F90
dec0233.hsn.de.hpc.ucar.edu 463: libesmf.so         00001489DC450219  callVFuncPtr             2187  ESMCI_FTable.C
dec0233.hsn.de.hpc.ucar.edu 475: forrtl: error (65): floating invalid

lnd.log:

 Input land mesh file /glade/campaign/cesm/cesmdata/inputdata/share/meshes/ne0ARCTICne30x4_ESMFmesh_c20200727.nc
 Input mask mesh file /glade/campaign/cesm/cesmdata/inputdata/share/meshes/tx0.1v2_ESMFmesh_cd5_c20210105.nc
 Obtaining land mask and fraction from mask file /glade/campaign/cesm/cesmdata/inputdata/share/meshes/tx0.1v2_ESMFmesh_cd5_c20210105.nc
 
 Attempting to read global dimensions from surface dataset
(GETFIL): attempting to find local file 
surfdata_ne0np4.ARCTIC.ne30x4_hist_1979_78pfts_c240908.nc
(GETFIL): using /glade/campaign/cesm/cesmdata/inputdata/lnd/clm2/surfdata_esmf/ctsm5.3.0/surfdata_ne0np4.ARCTIC.ne30x4_hist_1979_78pfts_c240908.nc
global ni,nj =   117398         1
model grid is not 2-dimensional

Computing land fraction and land mask by mapping mask from mesh_mask file

Metadata

Metadata

Assignees

Labels

bugSomething isn't working correctly

Type

Projects

Status

Done

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions