Re: Code appears to "hang" when using $!OMP PARALLEL block

Ioannis_K_ · ‎04-03-2024

Hello everybody,

I have a code with multiple DO loops, which has been verified to work fine when building and running both in Windows and Linux. I have tried to simplify the loop operations as much as possible, so that the instructions can be executed/vectorized as efficiently as possible.

I am now trying to explicitly introduce openMP directives in the code. For instance, I have the following block (please note that this is a piece of the code, not the whole code; all necessary variables are properly defined):

!================================================== begin code

!$OMP PARALLEL num_threads(nthrea2)
!$OMP DO
do J = 1,Nvec
U11(J) = (1.d0+aHHT)*U((LDPMends(1,iel+J)-1)*ndof+1)
* - aHHT*Uprev((LDPMends(1,iel+J)-1)*ndof+1)
end do !J
!$OMP END DO

!$OMP DO
do J = 1,Nvec
epsN(iel+J) = nx(iel+J)*U21(J) + ny(iel+J)*U22(J) +
* nz(iel+J)*U23(J)
end do !J
!$OMP END DO

!$OMP DO
do J = 1,Nvec
epsN(iel+J) = epsN(iel+J)-nx(iel+J)*U11(J)-ny(iel+J)*U12(J)-
* nz(iel+J)*U13(J)
end do !J
!$OMP END DO

!$OMP DO
do J = 1,Nvec
epsM(iel+J) = mx(iel+J)*U21(J) + my(iel+J)*U22(J) +
* mz(iel+J)*U23(J)
end do !J
!$OMP END DO

!$OMP DO
do J = 1,Nvec
epsM(iel+J) = epsM(iel+J)-mx(iel+J)*U11(J)-my(iel+J)*U12(J)-
* mz(iel+J)*U13(J)
end do !J
!$OMP END DO

!$OMP DO
do J = 1,Nvec
epsM(iel+J) = epsM(iel+J) - A1(J)*U14(J) - A2(J)*U15(J) -
* A3(J)*U16(J)
end do !J
!$OMP END DO
!$OMP END PARALLEL

!===================================================== end code

When I build this code in Windows, it runs without any issue (for instance, when using 10 threads).

However, when I try to run it on a Linux build with 128 available threads, the program appears to "hang" and does not produce any output. I tried to also run after commenting all the OMP DO directives, but I still had the same issue. Once I commented both the OMP PARALLEL and the OMP DO directives, then the code ran normally in the Linux build.

I wanted to ask if there is anything in the form of my code that could lead into problems.

In case it matters: for my Linux run: Before running, I set the environment variable OMP_PROC_BIND to TRUE and the environment variable OMP_PLACES to cores.

Thanks in advance for any help/advice.

Ron_Green · ‎04-03-2024

Applications don't always scale well with threading. Often times too many threads completely trash your performance. In your case, every variable reference is shared.

So start small on linux. Run it at 2, 4, 8, and then 10 threads. How does that compare to Windows?

It is really rare for an application to scale linearly to 128 threads with the exception of Monte Carlo algorithms. Plot out the runtimes at 2, 4, 8, 10 threads. Any sign that the scaling is hitting an elbow? Keep going out if not and see where scaling just does not pay off any longer.

also, what is the value of 'Nvec'? If you divide Nvec by #threads, how big is the iteration set (the block or chunk of work) for each thread?

Nvec should be greater than roughly 32*#threads to make it worthwhile. Roughly. Depends a lot on memory accesses.

Ioannis_K_ · ‎04-03-2024

Perhaps I did not explain it properly, but the issue is not slow performance or poor scaling.

The issue is the program gets completely stuck when I add the !$OMP directives to the above code. It does nothing. So, there must be some kind of "conflict" that makes the problem occur. The Windows build runs without any issue.

jimdempseyatthecove · ‎04-04-2024

I would suggest you check to assure that the OneAPI OpenMP shared library from your current version of OneAPI compiler is what is loaded at runtime. IOW make sure that no prior version of the runtime library is used.

I do not see anything wrong with your code, other than it might not be optimal. (2nd and 3rd loops can be combined, and 4th, 5th, 6th loops can be combined.

Does the program hang when you compile without optimizations?

Does compilation in Debug build with runtime checks for array bounds checking throw an error?

How large is Nvec?

Jim Dempsey

Ioannis_K_ · ‎04-04-2024

Thank you Jim. The problem was indeed due to an incompatibility of the library loaded at runtime. My code no longer hangs, but - as Ron had also expected - the execution for 128 threads in my Linux cluster is much slower than for, say, 10 threads in my desktop PC.

The value of Nvec is between 128 and 2048, and it is hardcoded in the source (I do not give it as input during runtime).

I noticed that you said that my code is not optimal. I had thought that having separate do loops might make it easier for the compiler to optimize/vectorize etc.

So, if I have the following 2 loops:

!===================================================== begin code

DO J = 1,NVEC

A(J) = ..... ! code defining each component of vector A()

END DO !J

DO J = 1,NVEC

B(J) = A(J) + 5 ! code using the components of A()

END DO !J

!====================================================== end code

Will it be better to merge all the loops together? I am asking because I have routines which involve dozens of such loops, and I was under the impression that it may be better to explicitly separate the loops from each other...

Ron_Green · ‎04-03-2024

Your code looks functional to me.

There is no bug in the OpenMP runtime that would cause a hang.

So the next questions are:

What linux OS distro and version?

Ifort or ifx?

What version of ifort or ifx?

And humor me and run the linux version at 8 threads, 16 threads, and 32 threads. Capture the runtimes for each.

jimdempseyatthecove · ‎04-08-2024

At your smallest size (Nvec=128), and thread count of 128, each thread executes only 1 statement per parallel region.

In order to experience any improvement with threading, the computational load distributed, must overcome the thread management.

Each loop body has one statement and say, 5 operations (some variables with 2 indexing operations). Hardly any work.

At Nvec of 2096, and thread count of 128, each thread would get 16 statements of the loop. This still is hardly enough to overcome the loop overhead.

So, as designed, and as used (Nvec=128:4096) 128 threads will be too many. As @Ron_Green recommend, run a test using different numbers of threads to locate the best count (for respective sizes).

By reworking you code (loop collapsing) you should get better performance.

You can also experiment with creating arrays of size=thread count (0:nThreads-1), named say iBegin and iEnd, and pre-compute the Nvec ranges for each thread. iow do the partitioning outside your (first post) parallel region. Note, depending on Nvec and discovered ranges

!$omp parallel private(iThread, i) num_threads(whatWorksBestForProblemSize)
iThread = omp_get_thread_num()
do i=iBegin(iThread), iEnd(iThread)
 ...
end do
!$omp barrier
do i=iBegin(iThread), iEnd(iThread)
 ...
end do
!$omp barrier
...
!$omp end parallel

Jim