a couple of questions on parallel_invoke

Petros_Mamales · ‎08-06-2012

Hi,
I would like to clarify a couple of points with parallel_invoke :
a) from what I see it is "limited" (with mic in mind) to 10 arguments (functors). Is there a
real ( other than not provided) limitation for using more ?
Will it be extended ?
Could this, for example, work ? (please, be patient I am way out of my league here ) ( schematically ):
#include

void my_parallel_invoke( std::vector< FuncBase *> const & v ) {
structured_task_group grp ;
// iterating over the boost fusion vector ( very-very schematically)
for ( i = 0 to v.size() )
{
task_handle< FuncBase const & > h( *v ) ;
grp.run(h) ;
}
grp.wait() ;
}
( again this is very-pseudo code, please don't get outraged ;-)). (The silly vector of pointers to a base functor can easily be replaced by a boost fusion vector, etc. )
Is there anything more to parallel_invoke that I am missing ?

b) can I use mkl calls from parallel_invoke? (mkl is OMP based). I understand from the documentation that it is not going to be optimal, but I really do not have any option here, since mkl comes to me as a binary, and calling a mt mkl function should be better than shutting down OMP, no ?.
In the documentation, some coverage is provided about the situation where tbb is called from OMP. Here the situation is the opposite and the impression I got is that this should be innocuous. Is my understanding correct ?
(Also the same for parallel_for),
TYVMIA,
Best Regards,
P-

RafSchietekat · ‎08-06-2012

One of the limitations in scalablity is related to the distribution of work, and even with its limited number of arguments parallel_invoke() tries its best to avoid that problem by using recursive parallelism. If you want to make your own implementation with an unlimited number of functions, you could leverage parallel_for()... but that also means you don't really need a my_parallel_invoke() layer.

Composability of TBB with OpenMP is a murky area, and I would prefer somebody else to comment on it.

Petros_Mamales · ‎08-07-2012

Raf,
Thank you very much for the reply.
If I understand you properly there is more to parallel_invoke than a simple fork-join. Using parallel_for is not
-conceptually-a natural choice, as my functors can be doing entirely different things.
But then, on a mic machine, how is one to do many ( =more than 10 )tasks at once, using tbb ?
Or using something as naive as the "glorified" fork and join from above would probably be OK ? ( although lack of scalability really suggests that it would not be ;-)).
Finally, can someone pick up the omp tbb inter-operability for the situation when tbb calls omp ( mkl FFT routines and linear equation solvers in particular - assume no memory allocators called from mkl routines ).
Thank you very much,
Petros

RafSchietekat · ‎08-07-2012

"Using parallel_for is not-conceptually-a natural choice, as my functors can be doing entirely different things."
The only criterion should be efficiency in (programming and) execution. You'll need some glue, but with a lambda it should be almost trivial. Unless the functors are known to be roughly equally expensive, do consider simple_partitioner with an appropriate grainsize.

parallel_invoke() conceptually does just that, except that with a limited number of cases, each with a fixed number of separate arguments, it is easier to just hardcode each case, but I doubt you'll notice much if any improved efficiency.

Petros_Mamales · ‎08-08-2012

Raf,
Thank you very much for responding. OK then, parallel_for it is.
Is this the solution tbb has for mic architecture as well?
And, more importantly, can I call mt (omp) lapack solvers (mkl) from, say, parallel_for/invoke ?
Can somebody respond to this, please. This is very important.
TYVMIA
Petros

Update: Can someone please answer on tbb calling omp threads (mkl) ?

Anton_M_Intel · ‎08-10-2012

Hi,

Parallel_invoke and others work the same way with TBB on all platforms; we dont do any platform-specific changes.

For the MKL part, the MKL users guide recommends that if MKL is called from other threaded code (e.g., pthreads, TBB, etc.) that MKL be run sequentially, not with OpenMP parallelism turned on. TBB team supports that recommendation as well for a number of reasons, mostly related to the semantics of OpenMP and how that causes issues with composability (with other threading models) of nested parallelism.

Thanks