Solved: TBB vs OpenMP on single threads

akhal · ‎08-24-2011

I have parallelized matrix mutliplication and image convolution algorithms using OpenMP and TBB and I was trying to check scalability of these models with number of cores from one to 8. I used "omp_set_num_threads(n)" for OpenMP and "task_scheduler_init TBBinit(n)" for TBB to control number of cores. I am using Intel Compiler. For n=1; In case of convolution, OpenMP shows no overhead and perform equally well compared to serial version (to my surprise) while TBB performs bad and start getting better only when I choose n>1 and this is natural.

The weird thing is with matrix multiplication that When I use optimization flag "-O0" i.e disable optimizations, TBB performs slightly bad than serial one with n=1; which is natural overhead; but OpenMP performs exactly equal to serial one which means it doesnt incur any overhead. And when for same n=1; when I use compiler flag "_O1", OpenMP performs better than even serial one, while TBB still performs bad than serial for one thread....... and with compiler flag "-O3" optimizations, TBB still is bad than serial for n=1 but now OpenMP performs twice as fast as serial one :) What is happening there? I am using static(schedule) in OpenMP, does it means OpenMP programs with static scheduling has NO OVERHEAD at all ? or how it can be explained.. ?

jimdempseyatthecove · ‎09-07-2011

Psudo code

iteratonSpace = (0,n)
if(SystemHasMoreThanOneThread) then
StartThreadPool(parallelRegon, iterationSpace)
else
parallelRegion(iterationSpace)
endif

Where StartThreadPool initiates a team, splits the iterationSpace into slices, then calls the parallelRegion as a function (or in some implementation jumps to the entry point).

Where parallelRegion in single thread system simply calls the parallel region with the complete iteration space (or in some implementation simply jumps/branches to the entry point).

So the above may condense to (pseudo code

iterationSpace = (0,n)
if(!multipleThreads) goto single
ConstructThreadTeam()
iterationSpace = Split(iterationSpace, nThreads, myThreadNum)
single:
iterate on iterationSpace
if(myThreadNum .ne. 0) return
...

The total overhead becomes two if/branch statements when onsingle thread system.

Jim

View solution in original post

ARCH_R_Intel · ‎08-24-2011

What OpenMP implementation are you using? Some OpenMP implementations have a special code path for a single thread, which effectively adds a single branch. Furthermore, OpenMP pragmas give the compiler more information about lack of dependences between loop iterations, whichenables someserial optimzations (notably automatic vectorization). Alas compilers don't recognize the same implications from TBB code.

Are you using the Intel compiler 12.0 or later? If so, the desired vectorization can be hinted at by using "#pragma simd" on the loop to be vectorized. It might help to post your TBB code so people can comment on the subtle issues that apply.

akhal · ‎09-04-2011

I have simple indepedent loop iterations where I use static scheduling of OpenMP. When I use static scheduling with Intel compiler O3 optimization level, then there is zero overhead for one OpenMP thread and it behaves exactly like sequential code, this is surprising for me and I wonder if it is possible and what could be specific reasons if it is possible ? ( I havent used #pragma simd anywhere)

akhal · ‎09-06-2011

Hello

Can you please elaborate the statement "Some OpenMP implementations have a special code path for a single thread, which effectively adds a single branch" a little? Looking forward for valuable comments, thanks.

jimdempseyatthecove · ‎09-07-2011

Psudo code

iteratonSpace = (0,n)
if(SystemHasMoreThanOneThread) then
StartThreadPool(parallelRegon, iterationSpace)
else
parallelRegion(iterationSpace)
endif

Where StartThreadPool initiates a team, splits the iterationSpace into slices, then calls the parallelRegion as a function (or in some implementation jumps to the entry point).

Where parallelRegion in single thread system simply calls the parallel region with the complete iteration space (or in some implementation simply jumps/branches to the entry point).

So the above may condense to (pseudo code

iterationSpace = (0,n)
if(!multipleThreads) goto single
ConstructThreadTeam()
iterationSpace = Split(iterationSpace, nThreads, myThreadNum)
single:
iterate on iterationSpace
if(myThreadNum .ne. 0) return
...

The total overhead becomes two if/branch statements when onsingle thread system.

Jim

akhal · ‎10-07-2011

Thank you so much.... it was really informative!!!

robert-reed · ‎10-07-2011

I doubt that it is zero overhead that your code experiences in using OpenMP with only the master as the thread team, but it should be very small. With static scheduling, OpenMP divides the range by the number of threads in the team (and may have code to bypass the divide if the divisor is 1) and then does the range as a single thread dispatch--so there's a little more work than with just a serial implementation. Whereas Intel TBB with one thread will still proceed with the standard divide-and-conquer strategy, splitting the range in half, and then in half again, continuing down to a threshold where it will switch to performing the work of the loop. Intel TBB expects that these chunks of range left behind will be available for stealing by other threads, but without other threads in the pool to take up that work, all the strategy does for a single thread pool is add more overhead to the actual computation.

The other unexpected performance results, as previously suggested, are likely due to compiler optimizations including vectorization thatare still available in single-threaded configurations. If you are using the Intel compiler, it would be interesting to add -vec-report2 to your compiler invocation and determine which of your variants get their loops vectorized and which don't.

Saint-Martin__Jerome · ‎05-08-2018

Is Open MP thread local safe ? ie : a local variable is unique for a thread ?