Attribution of OS threads

Fan_Y_ · ‎04-16-2013

Imagine that I have to mix the different programming model. In a single program, if I do "omp_set_num_threads(nb_threads_omp)", "tbb::task_scheduler_init(nb_threads_tbb); init.initialize(nb_threads)" and "__cilkrts_set_param("nworkers", nb_threads_cilk)" at the same time, what would be the OS threads number attributed by the system? The max or the sum, of three?

The second question. Would the different runtimes knows to manage those different independent set of OS threads, or it will always be the free OS threads that used by runtimes without any distinction?

Thanks in advance for your help!

Barry_T_Intel · ‎04-16-2013

Cilk and TBB both maintain global state with global pools of threads, so they play (reasonably) well together. If you use both Cilk and TBB in a program, you should get at most nb_threads + nb_threads_cilk threads.

OpenMP isn't so well behaved. If multiple Cilk workers call a function which uses OpemMP, you'll get nb_threads_cilk * nb_threads_omp threads.

There is a package called irml which attempts to limit and reuse threads between packages, but none of the packages give up threads casually under the theory that it's much faster to resume a suspended thread than to start up a new one. To use irml, you need to ensure that irml is in the path. All of the packages should find it and use it automatically if it's present. I'll pass your question on to someone who knows more about RML.

- Barry

ARCH_R_Intel · ‎04-16-2013

About the iRML (Intel Resource Management Layer): It's disabled by default. To enable it, add its location to your PATH (WIndows) or LD_LIBRARY_PATH (Linux) or DYLD_LIBRARY_PATH (Mac). In the Intel compiler distributions, it's a dynamic library in a directory tbb/lib/..../irml/. To build it from the open-source distribution of TBB, run "make rml" in the top-level directory of the distribution.

The iRML attempts to tell each package (TBB, OpenMP, Cilk) how many threads to use. However, getting any benefit out of the iRML is tricky, because OpenMP has many semantic restrictions that required that a certain number of threads be allocated to it, even if that number is suboptimal. Furthermore, because OpenMP does static work distribution by default, it is much more sensitive to oversubscription issues than TBB or Cilk. If you (or anyone else) are interested in the details, let me know and I can send the User Guide for the Resource Management Layer. It's mostly about how to write OpenMP so that the semantic restrictions are lifted.

jimdempseyatthecove · ‎04-16-2013

You might find following these rules of thumb helpful for mixed threading models:

o Construct your application such that only the main thread uses/calls anything with OpenMP
o When using MKL, use single thread version of MKL (each OpenMP and/or Cilk/TBB thread runs separate instance in parallel)
o Inhibit Cilk/TBB from using OpenMP
o Experiment with KMP_BLOCKTIME, 0 may be more effective (sometimes not).
o reduce the numbers of threads for each thread model. Depending on workload distribution, the total number of threads could potentially exceed number of HW threads. However, if both models have no idel time, then set total number to that of HW threads. i.e. "over-subscription" only matters when number of working threads (or spin-waiting threads) exceed number of hardware threads.

iRML can help, but it is not a fortune teller (cannot predetermine workloads). Benchmark tuning may be more effective.

Jim Dempsey