I have parallelized matrix mutliplication and image convolution algorithms using OpenMP and TBB and I was trying to check scalability of these models with number of cores from one to 8. I used "omp_set_num_threads(n)" for OpenMP and "task_scheduler_init TBBinit(n)" for TBB to control number of cores. I am using Intel Compiler. For n=1; In case of convolution, OpenMP shows no overhead and perform equally well compared to serial version (to my surprise) while TBB performs bad and start getting better only when I choose n>1 and this is natural.
The weird thing is with matrix multiplication that When I use optimization flag "-O0" i.e disable optimizations, TBB performs slightly bad than serial one with n=1; which is natural overhead; but OpenMP performs exactly equal to serial one which means it doesnt incur any overhead. And when for same n=1; when I use compiler flag "_O1", OpenMP performs better than even serial one, while TBB still performs bad than serial for one thread....... and with compiler flag "-O3" optimizations, TBB still is bad than serial for n=1 but now OpenMP performs twice as fast as serial one :) What is happening there? I am using static(schedule) in OpenMP, does it means OpenMP programs with static scheduling has NO OVERHEAD at all ? or how it can be explained.. ?