cilk reducer_min_index 4x slower than serial program

strategeme · ‎05-20-2012

Hello,

I have the problem with the performance using the reducer_min_index. The cilk version is 4 times slower than its serial counterpart. In the same time and on the same machine, using reducer_opadd in a similiar program results in an expected speed-up.

I provide here the code used. Could you please help me ?

Best regards

#include
#include
#include
#include
#include

using namespace std;
#define CILK_USE 1

#include

int compute(int i)
{
srand(i);
return (rand()%100 - (rand()%50)); // return a value computed from i
}

int main() {
long programStart = time(0);
int size = 10000000;

#if (CILK_USE)
cilk::reducer_min_index best_cost;
best_cost.set_value(-1,100);
#else
int best_cost = 100;
int best_index = -1;
#endif

#if (CILK_USE)
cilk_for (int index = 1; index < size; index++)
#else
for (int index = 1; index < size; index++)
#endif
{
#if (CILK_USE)
int val = compute(index);
if(val < best_cost.get_value())
best_cost.set_value(index, val);
#else
int val = compute(index);
if(val < best_cost)
{
best_cost = val;
best_index = index;
}

#endif
}

#if (CILK_USE)
cout << "cost " << best_cost.get_value() << " ind " << best_cost.get_index() << endl;
cout << "cilk " << endl;
#else
cout << "be cost " << best_cost << " ind " << best_index << endl;
cout << "no cilk " << endl;
#endif

cout << "!!!Hello World!!!" << endl; // prints !!!Hello World!!!
cout << "PROGRAM RUNNING TIME : " << time(0) - programStart << endl;

return 0;
}

Barry_T_Intel · ‎05-21-2012

To start, you're using reducer_min_index incorrectly. Instead of using

           if(val < best_cost.get_value())
              best_cost.set_value(index, val);

you should be using

           best_cost.calc_min(index, val);

which will do the comparison for you and you'll only need one reducer lookup.

Also, instead of initializing the reducer, you should trust it. reducer_min_index has a flag to indicate whether it has been set. So you can remove the line

  best_cost.set_value(-1,100);

If you want to check whether the reducer has a valid value, you can call best_cost.is_set(). At the end of the calculation.

- Barry

jimdempseyatthecove · ‎05-22-2012

Do not use rand in your dummy compute routine. It contains a critical section. This will impact your multi-threaded timings (rand essentially becomming serialized).

Consider filling an array[65536] of int containing the return value from rand. Then have each thread use a thread local variable initialized to a logical thread number, 0-based.

Do this outside of your timed region.

Inside the timed region, have each thread fetch its next random number using the index (i*nThreads + iThread)%65536.

This will eliminate the critical section, and produce reproducable results (for any given set of threads).

Jim Dempsey

strategeme · ‎05-23-2012

I would like to thank Barry for the accurate answer.

Jim, you raised an important question also for me. I also noticed that my (randomized) algorithms are not really reproducible (using the same random seeds). For example, the code

cilk_spawn print_rand(0, 5, seedValue0);
print_rand(1, 5, seedValue1);
cilk_sync;

with
void print_rand(int orig, int times, int seed)
{
srand(seed);
for(int i = 0; i < times; i++)
cout << orig << " " << rand()%100 << endl;
}

shows exactly, it seems to me, what you mentioned (serialization of random numbers).

On the other hand the following code

#pragma omp parallel private(nthreads, tid)
{
tid = omp_get_thread_num();
if (tid == 0)
{
srand(seedValue1);
print_rand(tid, 5);
}
if (tid == 1)
{
srand(seedValue1);
print_rand(tid, 5);
}
}

with
void print_rand(int orig, int times)
{
for(int i = 0; i < times; i++)
cout << orig << " " << rand()%100 << endl;
}

seems to give always the same sequence of numbers. (still, the same randomized algorithm

Could you please point me some resource where I can find what really happens with random generator when using cilk and / or openmp.

Thank you very much, Haris

Jim_S_Intel · ‎05-23-2012

The exact interaction of a random number generator with Cilk will depend on which random number generator you are using and how it is implemented.

A typical serial random number generator (RNG) produces a stream of random numbers serially. To produce a new random number, the RNG applies some function to transform the currentstate of the generator, and then derives a random number from that state. Because the generator state is transformed serially, to keep the generator thread-safe in a parallel program, a RNG often uses alock or some other kind of synchronization, which results in serialization. Setting the seed of the RNG corresponds to setting the initial state. Thus, by seeding two RNG'sthe same, you get the same random numbers. In Cilk, if you are using srand() and rand(), you may be using one serial RNG for the entire program.

[I'm not quite sure whether OpenMP does something different with srand(). Incidentally, your OpenMP code above is seeding both RNGs with seedValue1, while your Cilk code is seeding one with seedValue0 and the other with seedValue1. Is that intentional? Note also, that some documentation for rand() explicitly state that rand() is not thread-safe, so you should double-check with OpenMP documentation that your code above is legal.]

Alternatively, you can eliminate the problem of serialization in a parallel program by creating a separate serial RNG for each thread. Then, when you want to get a random number while executing on a particular thread, you can get numbers from that thread's RNG. See the rand_r function from for working a particular generator.
If both your program and the scheduler for the parallel-programming platform you are using are structured in a way that guarantees that each call to generate a random number is always executed by the same thread at thesametime relative to other RNG calls on the same thread (e.g.,you know that this RNGcall is always the 5th call to the RNG from thread 3), then you can get determinsitic results.

In Cilk, however, the underlying scheduler will not always execute the same code on the same thread every time. For example, in the following cilk_for loop, f(1) may execute on thread 1 in one execution, but thread 3 in the next execution. Thus, your program may see different random numbers generated between different executions if each thread is maintaining a separate RNG.

cilk_for(i = 0; i < 10; ++i) {
f(i);
}

In this particular example, what you could do is create 10 random number generators, and pass the ith generator to f(i). This solution is ok if the width of your cilk_for is only 10, but doesn't work as well when your cilk_for has n iterations, where n is large. In Cilk, you want to pick n as large to expose sufficiently parallelism to the scheduler. Unfortunately, large n would be bad from the perspective of having separate RNGs for each iteration.

Finally, at the risk of shameless self-promotion, I will refer you to the following paper and say that there are good solutions to the problem of deterministic parallel random-number generation in platforms such as Cilk, even when n above is large. :)

http://web.mit.edu/~neboat/www/papers/dprng/dprng-ppopp12.pdf

If determinsitic random-number generation is a serious problem for your Cilk application, I'd be happy to discuss the problem with you further. There are a variety of solutions to the problem (including the one described in the paper), and the simplest / bestone for you may depend on your particular application.

Cheers,

Jim

jimdempseyatthecove · ‎05-24-2012

Your "On the other hand" code accidentally gave the results you expected.
There is a race window between each thread's call to srand and the call to rand in print_rand.

The other Jim responding gives you some good advice.

The crux of the matter is: "What do you require of rand in a multi-threaded environment?"

It is generally required that a pseudo random number generator be predictable.

In a serial application, this means (using same initial seed) successive calls to rand (from the seed) produce the same sequence of pseudo random numbers. Most random number generators use something along the line of:

seed = next(seed);
return partial(seed);

Where prior seed is last produced random number or (initial seed), next() advances seed, partial returns a portion of the seed as a pseudo random number.

The seed value is stored in static (shared) memory. As the other Jim suggested you can make your own random number generator which holds the seed in thread local memory.

*** However "What do you require of rand in a multi-threaded environment?"

Same sequence each time called - all threads.
Same sequence each time called - on thread by thread basis.
Same sequence used until final disposition of fetched random number.

The last case would require a critical section starting just before the call to rand, and held until just after the final disposition of rand. Essentially serializing the parallel program for that duration.

When your code permits, an alternate method is to

enter critical section
obtain random number
mySequence = randSequence++; // mySequence in TLS
exit critical section
... do your code here...
while(nextWriteSequence != mySequence)
_mm_pause();
dispose of data
++nextWriteSequence;

Essentially you are serializing the disposition in the sequence order of that of fetching the random while permitting ... do your code here... to run in parallel.

You will have to decide for yourself how you need to handle the random numbers and then choose the appropriate technique.

Jim Dempsey

Barry_T_Intel · ‎05-24-2012

Deterministic parallel random-number generation for dynamic-multithreading platforms discusses a technique for generating repeatable random numbers in a Cilk application using pedigrees. It is supported by Intel Cilk Plus in the 12.1 compiler.

- Barry

TimP · ‎05-24-2012

I've been trying to gain an understanding of the relative role of cilk reducers. My current understanding is the cilk reducers, particularly the _index ones, incur a significant overhead in order to be thread safe at all levels of optimization. You would probably not choose the cilk reducer unless you are likely to use it in a cilk parallel region, such as cilk_for, where the thread safety is required and its overhead may be overcome. I don't think the reducers invoke parallelism by themselves at present, although it's not necessarily excluded.
You might compare __sec_reduce_min_ind against min_element() or open coded C, which depend on private variables and compiler accomplished register optimizations for thread safety. It took me a while to figure out that the Cilk reducer returns a 1 based result, analogous to Fortran, even though CEAN intentionally breaks other analogies to Algol and Fortran.
An annoying factor is that the extra code for thread safety in the cilk reducers may inhibit vectorization. If you're lucky, your own open code (or inner_product and the like) may come out better optimized. STL reducers ought to come out thread safe, since the reduction variables should be inside a private scope.

Jim_S_Intel · ‎05-24-2012

To understand Cilk reducers, I believe it is important to keep in mind one of their main uses, as a convenient replacement for global variables when you are trying to parallelize an existing serial program.

If you had a large serial program which was updating a global variable "x" to calculate a min, then to avoid races on x in a parallel program, you can replace x with a reducer, and change the places where you are updating x.

There are, of course, other ways to achieve the same effect without a reducer, which might have better performance. For example, you might be able to return intermediate values up the call stack, and perform the reduction manually. But to make this change in a large application, you might have to change the prototypes for a lot of functions, just to pass this parameter up.

Intuitively, the original global variable "x"might in retrospect have been a bad idea, especially if you wanted to parallelize the code. But the pain of removing the global variablemay not be worth the cost, especially if the accesses to that global variable are not where the majority of the time in the program is being spent. Reducers provide a way of easily eliminating the race on "x", without introducing the additional synchronization overheadof acquiring a lock, and without requiring you to restructure a bunch of functions in your program.
That is not to say that performance of Cilk reducers is not important. It sounds like there still several opportunities to further improve performance, and it would be nice to do so. But itmay beuseful to keep in mind that there were other considerations aside from absolute performance that contributed tothe original reducer design.
Cheers,

Jim