Solved: MKL's cluster FFT function gives out-of-memory error for large boxes

JR · ‎02-04-2023

I have written a very simple program that computes the 3D FFT of real gaussian noise in-place using MKL's Cluster FFT functions to do the computation concurrently, distributed over multiple processes in an HPC system.

PRECISION      = DFTI_DOUBLE
FORWARD DOMAIN = DFTI_REAL
DIMENSION      = 3
NUM_PROCESSES  = 2016
LATTICE SIZE   = 4032 x 4096 x 4096
LOCAL SIZE     =    2 x 4096 x 4098

It works fine for relatively small to large lattices. However, when my lattice is made of 4032 × 4096 × 4096 sites (double precision floating point numbers), and I distribute it over 21 nodes with 96 cores each, making a total of 2016 processes, so that each process handles 2 × 4096 × 2×(4096/2+1) sites corresponding to about 256 MiB of data for a local box (plus another 256 MiB for the DFTI_WORKSPACE), and I provide a maximum of 3800 MiB per core via SLURM, then during the in-place FFT computation step, the program crashes with a bunch of these memory handler error messages.

slurmstepd: error: Detected 2 oom-kill event(s) in StepId=39148997.0. Some of your processes may have been killed by the cgroup out-of-memory handler.
srun: error: hpc-node666: task 1897: Out Of Memory
[hpc-node144:985710:0:985710] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x8)
==== backtrace (tid: 985715) ====
 0 0x0000000000012cf0 __funlockfile()  :0
 1 0x0000000000011b72 rxm_post_recv()  osd.c:0
 2 0x00000000000128f6 rxm_replace_rx_buf()  rxm_cq.c:0
 3 0x00000000000129a9 rxm_handle_rndv()  rxm_cq.c:0
 4 0x000000000000c646 rxm_ep_trecv_common()  rxm_ep.c:0
 5 0x000000000000c766 rxm_ep_trecv()  rxm_ep.c:0
 6 0x0000000000404320 fi_trecv()  /p/pdsd/scratch/Uploads/IMPI/other/software/libfabric/linux/v1.9.0/include/rdma/fi_tagged.h:91
 7 0x0000000000404320 MPIDI_OFI_do_irecv()  /build/impi/_buildspace/release/../../src/mpid/ch4/netmod/include/../ofi/intel/ofi_recv.h:127
 8 0x0000000000404320 MPIDI_NM_mpi_irecv()  /build/impi/_buildspace/release/../../src/mpid/ch4/netmod/include/../ofi/ofi_recv.h:377
 9 0x0000000000404320 MPIDI_irecv_handoff()  /build/impi/_buildspace/release/../../src/mpid/ch4/src/ch4_recv.h:81
10 0x0000000000404320 MPIDI_irecv_unsafe()  /build/impi/_buildspace/release/../../src/mpid/ch4/src/ch4_recv.h:238
11 0x0000000000404320 MPIDI_irecv_safe()  /build/impi/_buildspace/release/../../src/mpid/ch4/src/ch4_recv.h:558
12 0x0000000000404320 MPID_Irecv()  /build/impi/_buildspace/release/../../src/mpid/ch4/src/ch4_recv.h:791
13 0x0000000000404320 MPIC_Irecv()  /build/impi/_buildspace/release/../../src/mpi/coll/helper_fns.c:625
14 0x000000000014133b MPIR_Alltoallv_intra_scattered_impl()  /build/impi/_buildspace/release/../../src/mpi/coll/intel/alltoallv/alltoallv_intra_scattered.c:186
15 0x00000000001927a8 MPIDI_NM_mpi_alltoallv()  /build/impi/_buildspace/release/../../src/mpid/ch4/netmod/include/../ofi/intel/ofi_coll.h:643
16 0x00000000001927a8 MPIDI_Alltoallv_intra_composition_alpha()  /build/impi/_buildspace/release/../../src/mpid/ch4/src/intel/ch4_coll_impl.h:1794
17 0x00000000001927a8 MPID_Alltoallv_invoke()  /build/impi/_buildspace/release/../../src/mpid/ch4/src/intel/ch4_coll_select_utils.c:2276
18 0x00000000001927a8 MPIDI_coll_invoke()  /build/impi/_buildspace/release/../../src/mpid/ch4/src/intel/ch4_coll_select_utils.c:3335
19 0x00000000001717ec MPIDI_coll_select()  /build/impi/_buildspace/release/../../src/mpid/ch4/src/intel/ch4_coll_globals_default.c:130
20 0x00000000002b44df MPID_Alltoallv()  /build/impi/_buildspace/release/../../src/mpid/ch4/src/intel/ch4_coll.h:240
21 0x0000000000142405 PMPI_Alltoallv()  /build/impi/_buildspace/release/../../src/mpi/coll/alltoallv/alltoallv.c:351
22 0x000000000002c3c2 MKLMPI_Alltoallv()  ???:0
23 0x000000000001d615 computend_simple_fwd()  bkd_nd_simple.c:0
24 0x000000000001c9ed compute_fwd()  bkd_nd_simple.c:0
25 0x000000000000253f DftiComputeForwardDM()  ???:0
26 0x000000000040230c main()  ???:0
27 0x000000000003ad85 __libc_start_main()  ???:0
28 0x00000000004018de _start()  ???:0
=================================

Is there an upper limit to how large a box I can simulate for FFT computations? If not, what am I doing wrong? Kindly advise. Thank you.

ShanmukhS_Intel · ‎02-21-2023

Hi Jillur,

We can see that you are using parallel studio 2020 update 4, which is an unsupported version according to the given URL:

https://www.intel.com/content/www/us/en/developer/articles/release-notes/intel-parallel-studio-xe-supported-and-unsupported-product-versions.html

As you can see from our previous response, we couldn't reproduce your issue. It worked fine with the latest version of oneAPI 2023.0. So, we recommend you upgrade to the latest version. If you still face any issues with the latest version, please get back to us.

Please find below the requested information.

(0) which version of Intel HPC Toolkit and Compiler you were using,
>> 2023.0.0

(1) how many cores there are per node in your HPC system,
>> 36 cores per socket
(2) if you have used the --exclusive request
>> No, We have removed this --exclusive request

(3) what is the maximum available memory per core is
>> We haven't used this option
(4) how much --memory-per-cpu did you request?
>> We haven't used this option.

(5) Moreover, what dimensions did you choose for your lattice?
>> NX = 4032, NY = 4096, NZ = 4096

Best Regards,

Shanmukh.SS

View solution in original post

JR · ‎02-06-2023

@SantoshY_Intel Can you or somebody else please provide some hints as to what I can do to resolve this issue?

To eliminate the possibility that this is a lattice-size-not-being-a-power-of-two artefact, I tried doing a forward transform on a 4096^3 lattice distributed over 2048 processes. The crash and backtrace I get is exactly the same. Something is going wrong when I try to compute very large FFTs. Please help. I am clueless. Thank you.

ShanmukhS_Intel · ‎02-06-2023

Hi Jillur Rahman,

Thanks for posting on Intel Communities.

It would be a great help if you share with us a sample reproducer and the steps to reproduce(if any). It helps us in looking into your issue and assist you further.

Best Regards,

Shanmukh.SS

JR · ‎02-06-2023

Please find attached the minimal working example which triggers this bug. I have tested this with Intel Parallel Studio XE 2022 Update 1 Cluster Edition (oneAPI) + the icx compiler as well as Intel Parallel Studio XE 2020 Update 4 Cluster Edition + the icc compiler. Both versions result in the crash when the lattice size is as large as 4096^3. The HPC system runs on RedHat 8.7, if that matters.

I appreciate any advice you can provide. Thank you very much.

ShanmukhS_Intel · ‎02-07-2023

Hi Jillur,

Thanks for sharing the sample reproducer.

We have tried compiling the shared sample reproducer. However, we are facing a few issues as mentioned below.

mpiicc -cc=icx -c -O3 -qmkl -Wall -Wextra -Wno-unused-variable -Wno-unused-but-set-variable -Wno-sometimes-uninitialized -D NX=4032 -D NY=4096 -D NZ=4096 dft_gauss.c -o dft_gauss.o

mpiicc -cc=icx dft_gauss.o -L/global/panfs01/admin/opt/intel/oneAPI/2022.3.1.17310/mkl/2022.2.1/lib/intel64 -lmkl_cdft_core -lmkl_intel_ilp64 -lmkl_sequential -lmkl_core -lmkl_blacs_intelmpi_ilp64 -lpthread -lm -ldl -o build

sbatch deploy.slurm build

sbatch: environment addon enabled

sbatch: error: Batch job submission failed: No partition specified or system default partition

make: *** [Makefile:21: deploy] Error 1

[sajjashx@eln8 gaussian]$ ls

build deploy.slurm dft_gauss.c dft_gauss.o Makefile

Could you let us know if anything was done in the incorrect way? In addition, we would like to request you share with us the steps to reproduce.

Best Regards,

Shanmukh.SS

JR · ‎02-08-2023

Hi Shanmukh,

You would need to modify the SLURM script according to the specifics of your HPC system. Looks like you need to specify a `--partition` parameter. The command line tool `sinfo` can help you find what partitions are available in your system.

My HPC system has 96 cores per node, and I have allocated 21 exclusive nodes, making a total of 2016 cores or processes. Note that 2016 divides 4032. If you do not have 96 cores available per node, you may want to remove the `--exclusive` request. If you change the number of processes requested, please make sure that it divides the first dimension.

Other than that, reproducing this bug is straightforward. Calling `make deploy` builds and deploys the code in the compute nodes. Check the output file with `tail -F job_4096_whatever.out`. After a while, you will notice the whole thing crashing down.

Best,
J. R.

ShanmukhS_Intel · ‎02-15-2023

Hi Jillur Rahman,

Thanks for sharing the inputs. We have modified the slurm parameters and have executed the same. We have removed the `--exclusive` request as suggested. We could see below the expected output as well.

Computing Forward FFT in-place.

Computing Inverse FFT in-place.

Best Regards,

Shanmukh.SS

ShanmukhS_Intel · ‎02-16-2023

Hi Jillur,

We have tried executing the code at our environment by changing the configuration and we could see the below output from the job below.

Running on hosts: eix[006-008,010-018,020-025,262-277]

Running on 34 nodes.

Running on 2448 processors.

Beginning the spectral simulator.

srun: environment addon enabled

Initialising random data that follows a Gaussian probability density function.

Computing Forward FFT in-place.

Computing Inverse FFT in-place.

--> Global RMS error between original and IFFT(FFT(...)) = 1.96e-21.

Maximum error between original and IFFT(FFT(...)) = 3.11e-15.

Kindly let us know if any difference in understanding. In addition, Could you please get back to us with the expected results as well?

Best Regards,

Shanmukh.SS

JR · ‎02-17-2023

Hi Shanmukh,

I am not sure what is going wrong in my case. The exact same code produces a bunch of memory errors on my HPC system and does not succeed.

May I ask (0) which version of Intel HPC Toolkit and Compiler you were using, (1) how many cores there are per node in your HPC system, (2) if you have used the --exclusive request, (3) what the maximum available memory per core is, and (4) how much --memory-per-cpu did you request? (5) Moreover, what dimensions did you choose for your lattice?

Best,
J.R.

ShanmukhS_Intel · ‎02-21-2023

Hi Jillur,

We can see that you are using parallel studio 2020 update 4, which is an unsupported version according to the given URL:

https://www.intel.com/content/www/us/en/developer/articles/release-notes/intel-parallel-studio-xe-supported-and-unsupported-product-versions.html

As you can see from our previous response, we couldn't reproduce your issue. It worked fine with the latest version of oneAPI 2023.0. So, we recommend you upgrade to the latest version. If you still face any issues with the latest version, please get back to us.

Please find below the requested information.

(0) which version of Intel HPC Toolkit and Compiler you were using,
>> 2023.0.0

(1) how many cores there are per node in your HPC system,
>> 36 cores per socket
(2) if you have used the --exclusive request
>> No, We have removed this --exclusive request

(3) what is the maximum available memory per core is
>> We haven't used this option
(4) how much --memory-per-cpu did you request?
>> We haven't used this option.

(5) Moreover, what dimensions did you choose for your lattice?
>> NX = 4032, NY = 4096, NZ = 4096

Best Regards,

Shanmukh.SS

ShanmukhS_Intel · ‎03-01-2023

Hi Jillur,

Thanks for accepting our solution. If you have any other queries, please post a new query as this thread will no longer be monitored by Intel.

Best Regards,

Shanmukh.SS