If one process launches two threads (A1 and A2), can one of the threads (say A2) launches 8 threads (B1, ... B8) again such that the total 9 threads running in parallel?
currently, my simple testing codes show that executing A1, finishing it, then executing A2( launches 8 threads) is much faster than launch A1 and A2 simultaneously. But i am not sure my codes use the correct ways or not and how to use nested omp efficiently.
the sequential call is faster based on my timing provided in the above pseudocode. But when I use 3 threads in the inner omp region of nested call, it gets faster as expected (suppose that four cores occpuied by four threads is the best case).
In my real codes, in fact, myHeavyFunc() is doing nothing but just launch GPU kernel. So although it is "heavy", the work is done on the GPU side. That thread is supposed not occupy any cpu rescource. I dont know whether the OS will put that thread in the pool but allocate the hardware resources to other CPU computing threads.
hope this can give you a rough idea what i am doing. thanks for the help!