Intel® Moderncode for Parallel Architectures
Support for developing parallel programming applications on Intel® Architecture.

nested omp parallelization

pilot117
Beginner
600 Views
Hi,
If one process launches two threads (A1 and A2), can one of the threads (say A2) launches 8 threads (B1, ... B8) again such that the total 9 threads running in parallel?
currently, my simple testing codes show that executing A1, finishing it, then executing A2( launches 8 threads) is much faster than launch A1 and A2 simultaneously. But i am not sure my codes use the correct ways or not and how to use nested omp efficiently.
thanks,
0 Kudos
3 Replies
jimdempseyatthecove
Honored Contributor III
600 Views
How many hardware threads are available on your system?

Can you provide a code sketch or sample program?

Are you timing the 1st time performing the nested calls or multiple times?
(discard the 1st time, average or pick smallest of next 5 times).

Jim Dempsey
0 Kudos
pilot117
Beginner
600 Views
Hi,

I have four core cpu. The testing codes is like this:

sequential call:
---------------------------------------------------------------------------------------
double start=omp_get_wtime();
myHeavyFunction();
omp_set_num_threads(4);
#pragma omp parallel
{
unsigned int thread_id = omp_get_thread_num();
if(thread_id==0)
func();
if(thread_id==1)
func();
if(thread_id==2)
func();
if(thread_id==3)
func();
}
printf("test time is %e\n",finish-start);
---------------------------------------------------------------------------------------

nested call:

---------------------------------------------------------------------------------------
double start=omp_get_wtime();
omp_set_num_threads(2);
#pragma omp parallel
{
unsigned int thread_id = omp_get_thread_num();
if(thread_id==0)
myHeavyFunc();
if(thread_id==1){
omp_set_num_threads(4);
#pragma omp parallel
{
unsigned int thread_id = omp_get_thread_num();
if(thread_id==0)
func();
if(thread_id==1)
func();
if(thread_id==2)
func();
if(thread_id==3)
func();
}
}
}
printf("test time is %e\n",finish-start);
---------------------------------------------------------------------------------------

the sequential call is faster based on my timing provided in the above pseudocode. But when I use 3 threads in the inner omp region of nested call, it gets faster as expected (suppose that four cores occpuied by four threads is the best case).

In my real codes, in fact, myHeavyFunc() is doing nothing but just launch GPU kernel. So although it is "heavy", the work is done on the GPU side. That thread is supposed not occupy any cpu rescource. I dont know whether the OS will put that thread in the pool but allocate the hardware resources to other CPU computing threads.

hope this can give you a rough idea what i am doing. thanks for the help!
0 Kudos
jimdempseyatthecove
Honored Contributor III
600 Views
Two things:

1) how many cores (or HT hw threads)are on your system?

2) add in front of your timed section of code

---------------------------------------------------------------------------------------
omp_set_num_threads(2);
#pragma omp parallel
{
unsigned int thread_id = omp_get_thread_num();
if(thread_id==0)
doNothing();
if(thread_id==1){
omp_set_num_threads(4);
#pragma omp parallel
{
unsigned int thread_id = omp_get_thread_num();
if(thread_id==0)
doNothing();
if(thread_id==1)
doNothing();
if(thread_id==2)
doNothing();
if(thread_id==3)
doNothing();
}
}
}
---------------------------------------------------------------------

Now run your timed section of code.

The next thing to do is to time each thread, use an array, be wary of reuse of thread_id.

Jim Dempsey
0 Kudos
Reply