Re: nested parallelism with OpenMP - very good speedup, but wit

dexter · ‎06-02-2009

I have a sequential program implementing a raytracing algorithm. Calculations of every pixel in the output image are independent and are done in two nested loops: the outer going through all lines of the image and the inner one going through each pixel in the particular line.

I am trying to parallelize this code, to get the best possible speedup. Obvious opitons are to parallelize one/both of those loops or combine them into one and parallelize it.

Speedup for 16 cores (SMP),for versions with parallelized inner/outer/combined loop and scheduling "dynamic,1" are similarand about 15.6 (never more than 15.8).

Surprisingly,the best speedup - about or even above 16 - is with the following (little strange) settings:

Both loops paralelized (omp pragma's inserted)with scheduling "dynamic,1", but total number of threads is limited to 16 (withOMP_THREAD_LIMIT=16, this is the number of available cores). As output image is 1000x1000 pixels, the inner loop is not really done in parallel. The warning is printed during execution, indicating that there are no enough threads available for the inner loop and it will be executed with one thread:

OMP: Warning #103: Cannot form a team with 16 threads, using 1 instead.

OMP: Hint: Consider unsetting KMP_ALL_THREADS (if it is set).

OMP: Hint: This could also be due to a system-related limit on the number of threads.

So this version seems to be the same as parallelisation of the outer loop, but the speedup is different (better in this case).

Why is that? Does it means that the compiler uses the information that iterations of the inner loop can be executed in parallel in some way? What the compiler exactly do with this information and one thread available? Is it possible to obtain the same speedup/give the compiler the same information in some different way, without warning?

Om_S_Intel · ‎06-02-2009

The number of threads should be same as number of core for optimum performace. If you use more threads than the available cores then overheads are increased.

You can analyze the application performance using Intel Thread profiler to know details.

dexter · ‎06-02-2009

Quoting - Om Sachan (Intel)

The number of threads should be same as number of core for ptimum performace. If you use more threads than the available cores then overheads are increased.

The number of threads is the same as the number of cores. It is limited with:OMP_THREAD_LIMIT=16.

Thus for outer loop there are 16 threads and inner loop is executed always with one thread (as in warning).

Hence, speedup is supposed to be the same as for parallelisation of the outer loop only, but it is not. It is better for nested parallelism with limited number of threads used.

TimP · ‎06-03-2009

Quoting - Om Sachan (Intel)

You can analyze the application performance using Intel Thread profiler to know details.

Is it true that -openmp-profile is to be changed so as not to work without thread profiler? Or is this a promotional gimmick?
With a normal linux x86_64 build, mixed ifort/icc, dynamic linked OpenMP, I can still do
export LD_PRELOAD=/libiompprof5.so
and the guide.gvs profile data are produced automatically. It's not even necessary to relink unless the -openmp-link static option is in use. This works also when a single source file gnu compiler build is linked against libiomp5, although the regions aren't labeled by source lines.
Maybe, like Windows VTune, thread profiler knows how to plot part of guide.gvs?

So, as to the original question, what does guide.gvs show about the comparison of parallelization methods chosen? What about increasing the chunk size, or using static or guided scheduling and setting KMP_AFFINITY?

I didn't check whether the LD_PRELOAD works on Windows, but it would be easy to try if there's any interest. I suppose VC8/VC9 don't label parallel regions as ifort/ICL do.

jimdempseyatthecove · ‎06-03-2009

Dexter,

You should not introduce more threads than cores (sometimes this is ok but more often than not this is a bad idea). Thread context switching introduces unnecessary overhead. Using OpenMP, and assuming all 16 cores will be available,I would suggest you set the number of threads to 16 and pin each thread to each core. Depending on the version of OpenMP you may have an environment variable to do the pinning. If not, then you will have to add startup function call to do the pinning (this is quiet easily done). Once pinned, the idea is to get the same threads to work on the same pixles on each iteration.

Use the "#pragma omp" without "for" to start the thread pool (16 threads or n threads). Obtain the number of threads with a omp_... function call and divide up the array into tiles. Run each tile using seperate threads. And run the same tile number in each respective team member number (with non-nested parallelism in OpenMP the OpenMP team member number can be assumed to be bound to the same thread on each entry into the parallel region).

Should you find that when running the distribution that often/usually not all 16 threads are available (e.g. display thread pushing data out to GPU) or other non-application threads (e.g. Ripping CD,email, twitter) then you might want to use 15 threads (n - 1 threads) or tile into smaller tiles and then use an interlocked increment to pick the tile from whatever threads are running.

If you are running on Windows, and elect to use (soon to be available QuickThread) you could quite easily tile within tiles

parallel_distribute(L3$, TileSocket, &Frame); // distribute Frame to sockets

void TileSocket(size_t TeamMemberNumber, size_t MembersInTeam, Frame_t* Frame)
{
// within socket, distribute tocurrent threadand any other waiting threads
parallel_for(Waiting_L3$, SubTile, 0, MembersInTeam, Frame, TeamMemberNumber);
}

If you attend this tomorrow's Webinar you can get additional information on QuickThread.

Jim Dempsey

nested parallelism with OpenMP - very good speedup, but with warning