Memory Problem in OpenMP with MPI

Julio · ‎08-17-2017

Dear Community;

I am using an hybrid formulation (MPI + OpenMP). When I use MPI with any number of process the computation goes very good and the output subroutine (where I print all my data) goes very well too. I use a collection of gather form my Output/Writing, not very efficient, I know.!!

However, when I use the OpenMP + MPI, the output task of my code stops working with the message (OOM Out of Memory). No matter if I set setnv OMP_NUM_THREAD = 1 the problem persists.

I have also used setnv OMP_NUM_THREAD = 2 and inside the code I have tried changing the number of threads using OMP_SET_NUM_THREAD before starting the Output subroutine but still I did not work.

I was wondering if is there an option or instruction to free the memory from the Threds. I see that even using OMP_SET_NUM_THREADS(1) did not workout.

What do you recommend me.

jimdempseyatthecove · ‎08-17-2017

First, does your program run directly? 1 thread, 2 threads, ...

OpenMP threads tend to default to having a relatively small stack size (1MB to 4MB) though this should not result in OOM. Can you tell us more about your environment?

Targeting 32-bit or 64-bit application. Number of hardware threads per node, number of nodes available, number of OpenMP threads requested, Linker options, etc...

Jim Dempsey

Julio · ‎08-17-2017

Hi Jim, thanks for your kindness!!!

Ok, each node has 16 cores, sockets per node =2 and cores per socket =8. i am using SLURM and I have not changed the configuration of the default script to build and submit jobs. I could use Tasks invocation controls but I am not sure if it will help, more info: https://slurm.schedmd.com/mc_support.html#srun_ntasks.

The implementation runs perfectly independently of the number of threads, the problem arises in the output section. That is why I wanted to know if there was an instruction to free the memory that threads allocates.

The cluster specification are these: The system consists of one head node for remote login and approximately 4 TeraByte of memory (10¹² bytes), 30 TeraBytes of disk space, 6CPU nodes with 32 eight-core Intel processors giving a total of 256 cores plus 2 CPU/GPGPU nodes with a 10 core Intel processor and 4 K40 Tesla GPU accelerators giving approximately 12.4 TFlops^* performance, 64 bits.

I tried with setenv OMP_NUM_THREAD 1 and the problem persists. The only way I can go through my processes succesfully is by compiling my case without -openmp flag. BEcuase even with one thread the problem pops up.

Thanks

jimdempseyatthecove · ‎08-18-2017

Many years ago (maybe 11), I had a simulation program that presented the OOM error condition. After a lengthy investigation of the situation without resolving the problem, I did a little more research into the triggers for this message. One of the causes is when your application runs out of memory (as the error message implies). This is a misnomer because it typically means you ran out of page file capacity. The second reason for the oom service to kill a job is when a significantly long compute section occurs, which appears to the oom service that your application is hung. For a fix, use Google and search for:

oom killer linux

look at either the configuration information or how to create exclusions.

Note, your system admin may have to get involved with this.

Jim Dempsey

Julio · ‎08-18-2017

Thanks Jim I appreciate our help!!