-
Notifications
You must be signed in to change notification settings - Fork 16
Open
Description
Running Comet code on Aurora with different particle structures, here are the observations:
- When using
SCSparticle structure, the code crashed at the beginning of the simulation. Sincechunk_widthsis present in the log file, I suspect it crashed duringSCSparticle structure construction.
NUM_OF_NODES= 1 TOTAL_NUM_RANKS= 12 RANKS_PER_NODE= 12 THREADS_PER_RANK= 1
Successfully created directory for writing field results on rank 0
Successfully created directory for writing field results on rank 0
Successfully created directory for writing field results on rank 0
Successfully created directory for writing field results on rank 0
terminate called after throwing an instance of 'std::runtime_error'
terminate called after throwing an instance of 'std::runtime_error'
terminate called after throwing an instance of 'std::runtime_error'
terminate called after throwing an instance of 'std::runtime_error'
terminate called after throwing an instance of 'std::runtime_error'
terminate called after throwing an instance of 'std::runtime_error'
terminate called after throwing an instance of 'std::runtime_error'
terminate called after throwing an instance of 'std::runtime_error'
terminate called after throwing an instance of 'std::runtime_error'
terminate called after throwing an instance of 'std::runtime_error'
terminate called after throwing an instance of 'std::runtime_error'
terminate called after throwing an instance of 'std::runtime_error'
what(): Kokkos ERROR: SYCLDeviceUSM memory space failed to allocate 1.678e+07 TiB (label="chunk_widths").
what(): Kokkos ERROR: SYCLDeviceUSM memory space failed to allocate 1.678e+07 TiB (label="chunk_widths").
what(): Kokkos ERROR: SYCLDeviceUSM memory space failed to allocate 1.678e+07 TiB (label="chunk_widths").
what(): Kokkos ERROR: SYCLDeviceUSM memory space failed to allocate 1.678e+07 TiB (label="chunk_widths").
what(): Kokkos ERROR: SYCLDeviceUSM memory space failed to allocate 1.678e+07 TiB (label="chunk_widths").
what(): Kokkos ERROR: SYCLDeviceUSM memory space failed to allocate 1.678e+07 TiB (label="chunk_widths").
what(): Kokkos ERROR: SYCLDeviceUSM memory space failed to allocate 1.678e+07 TiB (label="chunk_widths").
what(): Kokkos ERROR: SYCLDeviceUSM memory space failed to allocate 1.678e+07 TiB (label="chunk_widths").
what(): Kokkos ERROR: SYCLDeviceUSM memory space failed to allocate 1.678e+07 TiB (label="chunk_widths").
what(): Kokkos ERROR: SYCLDeviceUSM memory space failed to allocate 1.678e+07 TiB (label="chunk_widths").
what(): Kokkos ERROR: SYCLDeviceUSM memory space failed to allocate 1.678e+07 TiB (label="chunk_widths").
what(): Kokkos ERROR: SYCLDeviceUSM memory space failed to allocate 1.678e+07 TiB (label="chunk_widths").
x4311c0s3b0n0.hsn.cm.aurora.alcf.anl.gov: rank 2 died from signal 6 and dumped core
- when using
CabMorDPSparticle structure, the code is running fine if there are less than~20 million particles per GPU, but crashed if each GPU has more than~20 millionparticles:
Segmentation fault from GPU at 0xff00003aa2a20000, ctx_id: 1 (CCS) type: 0 (NotPresent), level: 2 (PDP), access: 1 (Write), banned: 1, aborting.
Abort was called at 274 line in file:
/home/ubit/rpmbuild/BUILD/intel-compute-runtime-25.05.32567.18/shared/source/os_interface/linux/drm_neo.cpp
Segmentation fault from GPU at 0xff00003aa2a20000, ctx_id: 1 (CCS) type: 0 (NotPresent), level: 2 (PDP), access: 1 (Write), banned: 1, aborting.
x4502c7s4b0n0.hsn.cm.aurora.alcf.anl.gov: rank 6 died from signal 6 and dumped core
x4502c7s4b0n0.hsn.cm.aurora.alcf.anl.gov: rank 1 died from signal 15
Code versions used:
- kokkos 4.6.02: https://github.com/kokkos/kokkos
- omega_h master branch: https://github.com/SCOREC/omega_h, at commit: cbb705a
- EnGPar master branch: https://github.com/SCOREC/EnGPar, at commit: 0dc778f
- Cabana master branch: https://github.com/ECP-copa/Cabana, at commit: d9900ea
- PUMIPic `cz/search_3d_plus_yus_aurora-support` branch: https://github.com/SCOREC/pumi-pic, at commit: 1f654d8.
- Comet `cz/aurora` branch: https://github.com/ComputationalGasDynamicsLab/Comet.git, at commit: 38711a4
Modules used to build:
Currently Loaded Modules:
1) gcc-runtime/13.3.0-ghotoln (H) 11) mpich/opt/develop-git.6037a7a 21) libmd/1.0.4-q6tzwyj (H)
2) gmp/6.3.0-mtokfaw (H) 12) libfabric/1.22.0 22) libbsd/0.12.2-wxndujc (H)
3) mpfr/4.2.1-gkcdl5w (H) 13) cray-pals/1.4.0 23) expat/2.6.4-7j6nhb6 (H)
4) mpc/1.3.1-rdrlvsl (H) 14) cray-libpals/1.4.0 24) gdbm/1.23
5) gcc/13.3.0 15) bzip2/1.0.8 25) python/3.10.14
6) oneapi/release/2025.0.5 16) lz4/1.10.0 26) gdb/15.2
7) libiconv/1.17-jjpb4sl (H) 17) libarchive/3.7.6-brn5xb5 (H) 27) gmake/4.4.1
8) libxml2/2.13.5 18) libmicrohttpd/0.9.50-by7j3p7 (H) 28) cmake/3.30.5
9) hwloc/2.11.3-mpich-level-zero 19) sqlite/3.46.0-w5wc5lh (H)
10) yaksa/0.3-7ks5f26 (H) 20) elfutils/0.186-edvhjaw (H)
Metadata
Metadata
Assignees
Labels
No labels