Interaction of -parallel option & manually threaded code?

Eric1 · ‎03-04-2013

Hi Folks - I've been working on getting a high performance scientific application working with the intel compiler suite (icc (ICC) 11.1 20100806) on a 20 core Intel box running RedHat 6. Program has been run successfully for years using the Gnu compiler suite. The code is already heavily threaded, but I thought I'd add the -parallel compiler option and see what else could be had. However, this has caused the program to hang at random iterations. It's always in the same function. This function is executed by multiple threads simultaneously via a thread pool. No synchronization occurs between the simultaneous instances of the function. When the program hangs, all but one of the instances of this function return, and top shows 100% CPU for the project, indicating that that thread is still running. The code is a large loop, and the hang never occurs at the same iteration of the loop.

If I reduce the size of the thread pool to 1, the run always completes, 2 it ussually completes. Increasing the size of the thread pool leads to more consistent & earlier hangs. And finally, after systematically removing compiler flags, I've found that removing the -parallel flag appears to have eliminated the problem.

So it seems like the optimizations introduced by the -parallel flag are interacting badly with this manually threaded code. Any suggestions on what I might do to resolve this? It's not a show-stopper, since I can just leave out that flag. But I'm curious what's going on and what this option could do for the performance.

Compiler command line now looks like this:

icc -xHOST -O3 -no-prec-div -c -O3 -Werror -DNDEBUG <snip lots of includes> -MMD -MP -MF build/Release_intel/GNU-Linux-x86/_ext/1472/sieve.o.d -o build/Release_intel/GNU-Linux-x86/_ext/1472/sieve.o ../sieve.cc

Link looks like this:

icc -xHOST -O3 -no-prec-div -o dist/myprog <snip lots of object files> -L/home/user/lib -L/usr/lib/mysql -L/usr/lib64/mysql -lmysqlclient -lpthread -lz -lgfortran -ltoollib_nopvm_i -lboost_iostreams -lifport -lifcoremt

There is one small bit of fortran in there - just noticed that I'm still linking against the Gnu fortran library - will go take that out now.

Thanks!

TimP · ‎03-04-2013

-parallel option asks the compiler to find opportunities for adding OpenMP parallelism. If you wish it to avoid threading which conflicts with yours, you would need #pragma noparallel to prevent auto-parallelization of your parallel blocks. You could use -par-report or -opt-report to check where the compiler has attempted parallelization.

While OpenMP on linux uses pthreads to set up its own thread pool, the best chance for it to recognize your previously set up pthreads pool would be to start out with omp_get_num_threads() before reaching any OpenMP or auto-parallelized region.

Eric1 · ‎03-04-2013

Thanks for the insight. Sounds like mixing a manual threading model and the -parallel option could be tricky at best. I think I'll steer clear of it for now.

SergeyKostrov · ‎03-04-2013

>>...iIt's always in the same function. This function is executed by multiple threads simultaneously via a thread pool. >>No synchronization occurs between the simultaneous instances of the function... It is really hard to tell what could be wrong but because Synchronization Objects are Not used ( I've asked myself How is it possible in a multi-threaded application?.. ) it could be one of the reasons of your problems. I don't think that Intel C++ compiler option -parallel will do synchronization for manualy created and managed threads.

Eric1 · ‎03-05-2013

Sergey Kostrov wrote:

>>...iIt's always in the same function. This function is executed by multiple threads simultaneously via a thread pool.
>>No synchronization occurs between the simultaneous instances of the function...

It is really hard to tell what could be wrong but because Synchronization Objects are Not used ( I've asked myself How is it possible in a multi-threaded application?.. ) it could be one of the reasons of your problems. I don't think that Intel C++ compiler option -parallel will do synchronization for manualy created and managed threads.

The function in question doesn't have to synchronize access to shared resources within the function - they are basically independently accessing diifferent parts of a large array so the separation of the data is explicit. Of course, the next level up waits on the return of calls, and other portions of the code use mutexes to control access to shared memory.

edit - ok on remembering a bit more, that's not quite true. There is a shared queue of result objects with a MUTEX wrapped around it.

Melanie_B_Intel · ‎03-05-2013

I have had luck using Intel Inspector to find threading errors in a parallel program. There's documentation available here: http://software.intel.com/en-us/articles/intel-parallel-studio-xe-2011-for-linux-documentation#inspector

SergeyKostrov · ‎03-05-2013

>>... the next level up waits on the return of calls, and other portions of the code use mutexes to control access to >>shared memory... This is a very important note. So, If you will be able to try -parallel option some time later these tips could help to understand what could be wrong: 1. If you have a Tracing API to a txt-file it makes sense to Enable it 2. If a Mutex is Not released at some proper time than another piece of codes that Waits for it could Wait for it Forever ( Deadlock / I suspect this is a possible Root cause of your problem ) 3. In some implementations of synchronization objects, for example in a Critical Section on a Windows platform ( I remember that you have RedHat 6 ), there is a Recursion Count member and it needs to be equal to zero when Nobody owns the object 4. Verifications with Inspector XE ( as Melanie suggested ) and/or VTune Amplifier XE will help as well 5. You could start your application in a Debuger and as soon as the "Infinite Processing" starts a hit to a Break should stop the processing and you will be able to see a Call Stack of functions 6. You could implement your own "Trap" as follows and needs to be placed in a suspected function: ... static int iCount = 0; ... // Some processing Starts ... if( ++iCount == 16777216 ) { DebugBreak(); // int3 or __debugbreak() intrinsic function } ... // Some processing Ends ...

jimdempseyatthecove · ‎03-06-2013

In the first reply by TimP he mentions "-parallel option asks the compiler to find opportunities for adding OpenMP parallelism.".

The following is unfounded ascertations:

-parallel option is (was) intended for use on single threaded application to automatically introduce a measure of parallelism. The code generated may (emphasis on may) assume the application is otherwise void of threading, and in particular of OpenMP threading. As such, a potential implimentation may omit OpenMP formal enter parallel region and exit parallel region. In particular, it may omit an omp_in_parallel() test to determine if it should bypass (when nested parallelism off) or create a new parallel region. What may be happening here is are multiples of your application threads (non-auto parallized threads) are entering an auto-parallelized loop, and all are attempting to use the same auto-parallelized thread pool under the assumption that all threads of that pool are available. IOW some portion (slice) of the auto-parallelized loop never completes.

You may have better success in using #pragma omp... from within the main thread and/or from omp tasks spawned from thereon. Then restrict you pThreads from using #pragma omp.... Depending on activity by your pThreads, you may have to tune the number of OpenMP threads. While you can make a hybrid application, it will take a little more care than adding a switch.

Jim Dempsey