No speedup with TBB and Cilk Plus sorting algorithms

Daniel_L_1 · ‎04-08-2015

I cannot get any speedup with <b>TBB</b> and <b>Cilk Plus</b> sorting algorithms on Xeon Phi, namely <pre class="brush:cpp">tbb::parallel_sort()</pre>, <pre class="brush:cpp">cilkpub::cilk_sort_in_place()</pre>, and <pre class="brush:cpp">cilkpub::cilk_sort()</pre>. I have tried to use 2, 4, 16, 61, 122 threads. With the very same program, the speedups on the 16-core Xeon host are excellent. The compiler is the same (Intel 15.0.2), the only difference is the -mmic command line argument and linking against MIC libraries. Any idea?

jimdempseyatthecove · ‎04-08-2015

Are you sure the number of threads you are trying to use are in fact the number of actual threads in use?

Is this an offload or native application? (-mmic would seem to indicate native)

When ssh-ing to mic, and in preparation to running your application, what does "export" say for your environment variables?

Jim Dempsey

Daniel_L_1 · ‎04-09-2015

I used the top command to monitor the number of running threads, it seemed to match the required numbers. It's native. Export says:

declare -x EDITOR="/bin/vi"
declare -x HOME="/home/langrd"
declare -x LD_LIBRARY_PATH="/home/langrd/lib:"
declare -x LOGNAME="langrd"
declare -x MAIL="/var/mail/langrd"
declare -x OLDPWD
declare -x OPIEDIR
declare -x PATH="/usr/local/bin:/usr/bin:/bin"
declare -x PS1="\\u@\\h:\\w\\\$ "
declare -x PWD="/home/langrd"
declare -x QPEDIR
declare -x QTDIR
declare -x SHELL="/bin/bash"
declare -x SHLVL="1"
declare -x SSH_CLIENT="172.31.1.254 42547 22"
declare -x SSH_CONNECTION="172.31.1.254 42547 172.31.1.1 22"
declare -x SSH_TTY="/dev/pts/0"
declare -x TERM="xterm"
declare -x USER="langrd"

where in lib dir I have libcilkrts.so.5, libomp5.so (both copied from host's intel/lib/mic/), and libtbb.so.2 (copied from host's tbb/lib/mic/).

jimdempseyatthecove · ‎04-09-2015

Why is there a ":" on LD_LIBRARY_PATH="/home/langrd/lib:"

What happens when you run something simple like the Cilk Fibonacci sample program (run micsmc on host to check on thread activity on MIC).

Jim Dempsey

Daniel_L_1 · ‎04-13-2015

Jim,

Fibonacci scales well. I use the very same settings for my sorting program, i.e., __cilkrts_set_param("nworkers", ...). However, even if I remove this call, both tbb::parallel_sort() and cilk_sort() do not scale at all.

I used micsmc, for Fibonacci, the desired number of cores were utilized. For TBB and CilkPlus, only a single core was utilized during sorting. But I can't figure out why. As I have written, on the host machine the same code scales well.

The LD_LIBRARY_PATH was set to TBB, CilkPlus and Intel OpenMP shared libraries, since the MIC was not fully installed then and the file system was not shared with the host. Now, it is shared, so I do not use this setting anymore.

jimdempseyatthecove · ‎04-14-2015

What is being sorted?

What are your Compare and swap functions?

Jim Dempsey

Daniel_L_1 · ‎04-14-2015

Sorted are randomly generated 64-bit integer numbers. I do not specify custom compare/swap functions. The code is quite simple something, like:

std::vector<uint64_t> v(n);
... // generate random numbers
#ifdef TBB
tbb::task_scheduler_init init(sort_num_threads);
tbb::parallel_sort(v.begin(), v.end());
#else 
std::string s_temp = std::to_string(sort_num_threads);
__cilkrts_set_param("nworkers", s_temp.c_str());
cilkpub::cilk_sort_in_place(v.begin(), v.end());
// cilkpub::cilk_sort(v.begin(), v.end());
#enif

jimdempseyatthecove · ‎04-14-2015

Does your timed interval include the generation of the random numbers?

Random number generators typically contain a critical section. There are parallel versions of RNG, but the standard ones ("vanilla flavor" ) contain a critical section. This serializes the random number generation. IOW nearly all of your wall clock time may be consumed generating the random numbers. BTW the more threads you have the longer it takes to get through the call to the random number generator.

Consider timing just the sort portion of the program.

Jim Dempsey

Daniel_L_1 · ‎04-15-2015

I measure runtime of sorting functions only. Moreover, as I mentioned, for some unknown reason both TBB and CilkPlus sorting algorithms utilize only a single core on Xeon Phi (use only a single thread; observed by micsmc). On the host they use as many threads as I want.

jimdempseyatthecove · ‎04-15-2015

Strange. I suggest you use the same source file and make file and copy to MIC shell as you do for the failing sort test program...
With this modification:

std::vector<uint64_t> v(n);
... // generate random numbers
std::cout << "sort_num_threads " << sort_num_threads << std::endl; // add
#ifdef TBB
tbb::task_scheduler_init init(sort_num_threads);
// tbb::parallel_sort(v.begin(), v.end());                  // comment out
YourParallelFibButWithout_task_scheduler_init();            // add (add function too)
#else 
std::string s_temp = std::to_string(sort_num_threads);
__cilkrts_set_param("nworkers", s_temp.c_str());
cilkpub::cilk_sort_in_place(v.begin(), v.end());
// cilkpub::cilk_sort(v.begin(), v.end());
#enif

And compile for TBB.

If the above test fails to use more than 1 thread then you know something is wrong with your configuration (environment variables, libraries copied, etc...).

It is odd that both TBB and Cilk are exhibiting this behavior.

Jim Dempsey

Daniel_L_1 · ‎04-18-2015

Jim, thanks for the hint, it seems I finally find the cause of the problem. For performance reasons, I generated random numbers for sorting in parallel using OpenMP. Whenever an OpenMP parallel region precedes CilkPlus or TBB code on (our) XeonPhi then something goes wrong. Consider the following code:

int main(int argc, char* argv[])
{
/*
#pragma omp parallel num_threads(1)
    std::cout << "Parallel region..." << std::endl;
*/
    std::vector<unsigned long> v(100000000);
    std::generate(v.begin(), v.end(), std::rand);

    __cilkrts_set_param("nworkers", argv[1]);

    auto start = std::chrono::system_clock::now();
    cilkpub::cilk_sort_in_place(v.begin(), v.end());
    std::chrono::duration<double> duration = std::chrono::system_clock::now() - start;
    std::cout << "Duration: " << duration.count() << " " << std::endl;
}

The measured sorting times are 45 seconds for 1 thread and 12 seconds for 61 threads. But, when I uncomment the OpenMP directive, the sorting time becomes 201 seconds for 61 threads! (Note that there is only 1 thread used in the OpenMP region.)

For TBB, the problem is similar: 40 seconds for 1 thread and 4 seconds for 61 threads without OpenMP. With OpenMP, 102 seconds for 61 threads.

Strange is that on the CPU-based host machine the behaviour is defferent. Is this a compiler/library bug, is something wrong with our Xeon Phi, or it is generally not a good idea to combine CilkPlus/TBB with OpenMP?

Daniel Langr

TimP · ‎04-18-2015

Could you incur more blocktime delays on coprocessor? kmp_blocktime default is the same on both. Openmp holds the logical processors for this interval. cilk worker can't use it until blocktime expires.

I limit openmp threads and cilk workers to 118 each so the effect of varying blocktime may not be as strong as it could be.

jimdempseyatthecove · ‎04-18-2015

>>or it is generally not a good idea to combine CilkPlus/TBB with OpenMP?

The problem is that each threading paradigm creates a thread pool (with the possible exception that Cilk and TBB might create a shared thread pool). This leads to oversubscription (meaning more software threads than hardware threads). To complicate the issue of oversubscription, a typical (universal) design of pooled threading systems is, at the end of a parallel region (or barrier), the additional threads of the thread team go into a compute intensive Spin-Wait for up to some set point number of iterations (typically based on iterations per milliseconds). If during the spin wait, the master thread (of the team) enters a new parallel region, the Spin-Waiting threads of the team immediately exit the Spin-Wait. This procedure eliminates (reduces) a requirement of having the master thread (of the team) making a system call to the O/S to wake up the other threads (of the team). Should the Spin-Wait time expire, then the other threads (of the team) suspend themselves and await for the master thread (of the team) to call the O/S to wakeup those threads.

The "easy" practice is to stick with only one threading paradigm.

This said, if you are careful with your programming, sometimes it is advantageous to use multiple threading paradigms. The trick is in how to best manage the Spin-Wait time. As Tim suggest, one method is to set the KMP_BLOCKTIME to 0. Caution, while your test program will indicate setting KMP_BLOCKTIME to 0 "fixes" the problem, you will invariably find this is not a universal solution (or you will unknowingly miss optimization opportunities).

Daniel, this is not to criticized you in particular, but rather a comment for other "new to parallel programming" programmers reading this forum....

Not telling the forum that you were mixing OpenMP with Cilk in your first post led to a significant number of fruitless replies, and time delays to you. For any initial post, it is extremely important to provide sufficient information for the experienced people to provide the solutions to the problems.

There is the relatively recent colloquial phrase "going sideways". This typically is represented in the movies with a scene framed inside a car driving along the street and having a truck run through an intersection, crashing into the car. Your post #11 is that truck.

Jim Dempsey

Daniel_L_1 · ‎04-21-2015

Tim, Jim,

thank you much for the explanation, I didn't know anything about this spin-wait issue and it is quite important for me to get familiar with it for future development. My code discussed here is only a benchmark for comparing different sorting implementations, so I simply resolved the issue by removing parallel random number generation.

Sorry not to mention the mixing threading paradigms previously, I was confused by different behavior on the MIC and on the host, therefore it didn't come to my mind first that this could be a problem (I expected it more in some MIC settings/setup).

Daniel