how to use cilk efficiently within shared variables

Ivan_M_ · ‎03-29-2013

Hello,

I have the problem with using cilk plus. could you help me, please ?

G class contains the matrix (as a vector) GenoMatrix[nIndividuals * nMarkers], where nInidividuals = 5000 and nMarkers = 50000

there is also the int * Criteria::solution[nIndividuals] variable of the Criteria class

I try to use cilk plus to accelerate the code but I have a data race problem with shared vectors (solution, GenoMatrix) and I don't know how to solve it correctly.

Thank you for your aid !

[cpp]

int Criteria::UncoveredMarkers(int alleleType)

{
cilk::reducer_opadd<int> presented(0);

cilk_for (int j=0; j<G->nMarkers; ++j)

{

for (int i=0; i<G->nIndividuals; ++i)
if ( (solution == alleleType) && (G->getMarker(i,j) == alleleType) )
{
++presented;
break;
}
}

return G->nMarkers - presented.get_value();
}

[/cpp]

Jim_S_Intel · ‎03-29-2013

Hi, I don't see anything out of the ordinary about your use of the reducer, but I can't see the complete program, so I'm not sure. Having some more information would help in tracking down a problem. Why do you suspect that there is a race in this loop? Can you tell us more about the behavior you are seeing when you run the program? Are you running the Cilk screen race detector, and is it pointing you at this loop? If Cilk screen detects a race, it should usually point you at two parallel accesses that cause the race. At a quick glance, the loop seems to be ok, unless there is something strange going on inside the definition of G->get_marker()? There might also be a race between the accesses in this parallel loop and other code running elsewhere? But I think that we need a little bit more information. Is it possible to construct a small program that compiles that reproduces the problem? Thanks, Jim

Ivan_M_ · ‎03-29-2013

please, find the VS project in attachments. hotspot is "int UncoveredMarkers(int)" (criteria.cpp) function

I've tried cilk and openmp the both do not give valide results:
- the computations are not valide
- the computing time is more than a sequential version of the code

I guess that there are data race and a big overhead but I don't understand how to solve it.

in the current code I delete all cilk and openmp options, in order to have a numerically valide start point.

in the "calibration.cpp" file, for the test purposes one can generate a random data matrix
[cpp]

Genotype* G = new Genotype(n_Individuals, n_Markers, NA_ratio);
//G->Save("Genotype.txt");
//Genotype* G = new Genotype(genotypeFile, n_Individuals, n_Markers);

[/cpp]

[cpp]

int Criteria::UncoveredMarkers(int alleleType)
{
int presented = 0;
for (int j=0; j<G->nMarkers; ++j)
for (int i=0; i<G->nIndividuals; ++i)
if ( (solution == alleleType) && (G->getMarker(i,j) == alleleType) )
{ ++presented; break; }
return G->nMarkers - presented;
}

[/cpp]

Jim_S_Intel · ‎04-03-2013

Hi, Sorry for the delay in responding to the post. I seem to not be getting emails about new forum posts recently. I tried running your compiling your test program on my Ubuntu desktop, (just because I am more familiar with coding on Linux), and noticed a few different items that might explain some of the nondeterminism you are seeing? I wanted to check whether these items are consistent with what is happening on your machine, or if I might have changed something by mistake in my running of the program. In trying to get the code to work, I ran into a few minor issues. I'm not sure any of these are actually contributing to the behavior you are seeing, but I'll report them just in case: 1. When I compiled the program on my 64-bit machine, in random.h, I get "integer operation result is out of range" warnings in the calculation of "RANDMAX + 1", possibly because it is interpreting the denominator as an int instead of a floating point value? I cast the denominator to get rid of the warning. 2. The readGenotype method and saveGenotype methods on Linux complain about passing a string object to file.open, instead of a const char*. I changed it to pass in fileName.c_str(). 3. I couldn't find a "Genotype.txt" file in the download package. Thus, in the "readGenotype" method, I think the program was reading in a garbage value for "allele" instead of something from file. When I ran the code through a memory checker (valgrind in my case), it was complaining about reading this uninitialized value in criteria.cpp. I initialized "allele" with a dummy value for now (i + j). 4. In random.h, I noticed in the constructor, the random number generator is being seeded with the time, instead of a deterministic seed. That was causing the serial program to execute nondeterministically for me. After I changed #4, I seemed to get repeatable results in running your code. I changed "presented" to a reducer in UncoveredMarkers, and still seemed to be seeing repeatable results. I tried running the code through cilkscreen as well, and it did not find any races. To do so, I needed to cut down the problem size considerably, and I'm not quite sure I necessarily did that correctly for your application. Is it possible that the nondeterministic seeding of the random number generator explains the problem you are seeing? On a different note, is it possible to parallelize the 3 calls to UncoveredMarkers in fObj by spawning them, instead of (or in addition to) parallellizing within UncoveredMarkers? If you can, it is generally more efficient to exploit task parallelism at coarser rather than finer granularities. Cheers, Jim

jimdempseyatthecove · ‎04-04-2013

Can you show your member funciton getMarker(i,j)?

The way your loop is written, you want to assure getMarker(i,j) and getMarker(i+1,j) are in adjacent memory locations for good cache locality.

Jim Dempsey

Ivan_M_ · ‎04-08-2013

[cpp]

int *GenoMatrix;

void Genotype::generateGenotype()
{
int size = nIndividuals * nMarkers;
GenoMatrix = new int[size];
Random rng;

//cilk_for(int i=0; i<size; ++i)
for(int i=0; i<size; ++i)
{
if (rng.randFloat01() > NA_ratio)
{
float r = rng.randFloat01();
if (r < 0.33f)
GenoMatrix = AA;
else if (r>= 0.66f)
GenoMatrix = BB;
else
GenoMatrix = AB;
}
else
{
GenoMatrix = NA;
}

//printf("%d\n", GenoMatrix);
}
return;
}

Genotype::Genotype(int nIndividuals, int nMarkers, float NA_ratio)
{
// set sizes
Genotype::nIndividuals = nIndividuals;
Genotype::nMarkers = nMarkers;
Genotype::NA_ratio = NA_ratio;

// init alleles
AA = 0;
AB = 1;
BB = 2;
NA = 999;

// generate matrix
generateGenotype();
}

int Genotype::getMarker(int const index, int const position)
{
return GenoMatrix[nMarkers*index + position];
}

[/cpp]

for the test purpose genotype matrix is generated randomly, buy also it can be read from txt file.

jimdempseyatthecove · ‎04-09-2013

[cpp]

// outer loop
cilk_for (int j=0; j<G->nMarkers; ++j)
{
// inner loop
for (int i=0; i<G->nIndividuals; ++i) // v inner lcv on left
if ( (solution == alleleType) && (G->getMarker(i,j) == alleleType) )
...
// inner lcv ----------------------v outer lcv -----v
int Genotype::getMarker(int const index, int const position)
{
// inner lcv taking large stride
return GenoMatrix[nMarkers*index + position];
}
[/cpp]

At issue is as the inner loop iterates it is taking a large stride through GenoMatrix.
Consider swapping the index method by using:

[cpp]
return GenoMatrix[nIndividuals * position + index];
[/cpp]

Jim Dempsey

Jim_S_Intel · ‎04-09-2013

Jim is right, that you may see a performance benefit in swapping your index method.

I tried it quickly though, and I did not notice much of an impact in terms of getting speedup from parallelization. With regards to parallelization, I think the bigger issue in this case is that each call to the UncoveredMarkers function does not do enough work to be parallelized effectively.

When I tried running with the following parameters: "Genotype.txt 30 250 0.85 0.75 100 100", I noticed that "nMarkers = 250" and "nIndividuals=30". There seemed to be little work being done in each call to UncoveredMarkers. When I timed each call in the serial code, each call was taking < 20 us to finish. In this case, I would expect the overheads of distributing such fine-grained work are going to prevent you from seeing any gains for parallelizing within this call.

With Cilk Plus, you generally want to parallelize the "outermost" functions in the code, i.e., the ones closest to main, because those are the ones that usually correspond to the largest tasks, and parallelizing them adds the least overhead. Here are a few other suggestions:

1. I noticed that in algorithm.cpp, in the Run() method, you have a for loop over each of the NP individuals. I couldn't quite tell whether there was a depedency between iterations, but if you can parallelize that loop, that would probably have the biggest effect. (I'm assuming that there is a dependency between generations, and there is no way to restructure your problem to parallelize that loop.) You might need to create a separate "trial" array for each individual, and perhaps the "fnc" variable could be a max reducer?

2. Another thing to watch out for if you parallelize that loop, is that the random-number generator (RNG) should not be shared between iterations. Sharing the RNG would be (a) nondeterministic, because there is a race on accessing the RNG state, and the random numbers you get might depend on how the execution is interleaved, and (b) it may hurt performance because of the sharing.

In your case, it is might not be too bad to create a RNG for each individual at the beginning of the program?
But as a side note, I might as well mention that I have actually written a recent article about deterministic random-number generation in Cilk Plus. :) If you are using a recent version of icc, we have posted some library code online that you can use as well.
http://software.intel.com/en-us/articles/a-dprng-for-cilk-plus

The library code online is intended to make it easier to take a normal RNG, which might be declared as a global variable and accessed in many places in a program, and replace that RNG with a deterministic parallel RNG. This approach avoids the sharing problems without requiring the user to figure out how manually to duplicate the RNGs to avoid the sharing.

3. You might also try parallelizing the fObj method:

int xAA = cilk_spawn UncoveredMarkers(G->AA);
int xAB = cilk_spawn UncoveredMarkers(G->AB);
int xBB = cilk_spawn UncoveredMarkers(G->BB);
cilk_sync;
return xAA + xAB + xBB + ksi*(nAA + nAB + nBB);

It might help a little bit, if you are running on a small number of workers (e.g., less than 4), but parallelizing the loop over individuals should be much more beneficial.

Ivan_M_ · ‎04-10-2013

Dear Jim S. thank you for your detailed explanations and advices. I'll try all this.

Here is I'll publish the explanations of Jim D. about outer/inner loops

Your GenoMatrix is a single block of memory with size = nIndividuals * nMarkers

You must make a design decision as to which (marker or individual) is to lie in adjacent physical locations
(marker,individual), (marker+1, individual), (marker+2, individual), ...
.OR.
(marker, individual), (marker, individual+1), (marker, individual+1), ...

Whichever you choose to use as adjacent in memory, then that index should be the inner loop control variable.

Change your indexing for getMarker(int const index, int const position) and putMarker(int const index, int const position) to both use the same indexing scheme [nMarkers*index + position] with inner loop varying position .OR. [nIndividuals * position + index] with inner loop varying index. This may necessitate swapping loop orders.

In your original code snip, the inner and outer loop control variables were reversed. One of the two loops is non-optimal.

in 2nd follow-up with getMarker, it illustrated that you used the outer loop control variable to make adjacent steps. And subsiquently the inner loop control variable was making large strides. This is not cache line friendly.

Second point, never include code containing critical sections within your benchmarking (unless that code is required for your applicaiton). In your sample code you were using a random number generator. These typically have a critical section, and thus are not representative of parallel code without a critical section.

Jim Dempsey

Ivan_M_ · ‎04-11-2013

I resume what I did.
I've changed indexing in getMarker and putMarker (see Jim Dempsey advice). It helps to accelerate.
I tested cilk_spawn in fObj, it increases the speed, but will not suitable on server with core >4 (my case)
Now I'm trying to parallize the second loop for in algorithm.cpp Run() function.

To have a rapid solution with random numbers, I used NP array of random class.
I created two arrays current(pop) and new(newpop) to store new solutions into newpop, and make swap of pop and newpop and the end of each generation g.
A add cilk::min_index reducer (b) in order to save the best solution.
I tried tbb::mutex for the line
[cpp]

if (fnc <= fpop[best]) best = i; // update best

[/cpp]
it gave compilation error.
the final result is not deterministic! and I cannot understand why.

I attached the algorthm.cpp file

I have some other questions to Jim S. not concerning this algorithm but more general ones. Jim, I read your paper, as well as its scientific version with the ppt presentation, it is very interesting and I will try your DPRNG a little bit later. I'm very curious what kind of scheduling algorithm is used for dynamic multitreading? would Cilk technology applicable to XeonPhi co-processor? thank you!

Jim_S_Intel · ‎04-11-2013

I took a quick scan through your code (but didn't try running or compile anything). I noticed a few things which might explain some of the behavior you are seeing.

1. The use of the reducer_min_index reducer "b" is not quite right. There are calls to b.get_value() and b.set_value() within the cilk_for loop, which is a race. You should only call the "get_value / get_index / set_value / set_index" methods on the reducer outside the parallel region, after all the updates to the reducer are complete. Instead, inside the parallel region (the cilk_for), you should call "calc_min" to update the value.

The "get_value" method is really intended to be used only to get the value out when you are done. When the get_value() method is called inside the cilk_for, you may not get the right answer.

I have attached a simple code example that uses reducer_min_index, which may help.

2. When you created the multiple RNG objects, did you also remember to modify your random.h, to use "rand_r" instead of just "rand"? If you are still using rand, then you still have multiple RNG objects wrapping the same underlying rand(), instead of creating separate random number generators.

3. I'm not quite sure what error you are getting using tbb::mutex, but once you have the reducer working correctly, you should not need locks.

4. If you try out the DPRNG code, I'd be interested to know how it works, if you run into any problems, or have suggestions for improvements! :)

5. The different variants of Cilk (including Cilk Plus) are instances of dynamic multithreading languages / platforms. Many of these platforms use a randomized work-stealing scheduler for load balancing. There are a few entries in our FAQ about how the Cilk Plus runtime does task scheduling.

http://www.cilkplus.org/faq/20

6. Cilk Plus does work on the Xeon Phi co-processor. We also talk a little bit more about it in our FAQ.
http://www.cilkplus.org/faq/26

Cheers,

Jim S.

Jim_S_Intel · ‎04-12-2013

Oops, it looks like I forgot to attach the example program correctly. It should be attached now...
Cheers,

Jim

Ivan_M_ · ‎04-16-2013

Jim Sukha (Intel) wrote:
If you try out the DPRNG code, I'd be interested to know how it works, if you run into any problems, or have suggestions for improvements! :)

Jim, when you compared your DPRNG with MT algorithm, which implementation of the MT you used ?

I asked this question because there are so many implementations: SFPMT, SSE2_MT,... and many more

Jim_S_Intel · ‎04-16-2013

When we ran our experiments, we were comparing against the implementation in the GNU scientific library. (gsl_rng_mt19937)

http://www.gnu.org/software/gsl/manual/html_node/Random-number-generator-algorithms.html

It has been a while, but I believe we chose that one mostly out of convenience, rather than for any significant technical or performance reasons. Cheers,

Jim

Ivan_M_ · ‎04-16-2013

is it possible to run DPRNG with v11 of compilers ?

[cpp]

#if defined(__INTEL_COMPILER)
# if (__INTEL_COMPILER < 1300)
# pragma message("__INTEL_COMPILER is" __INTEL_COMPILER)
# error "<cilkpub/pedigrees.h> may not work using Intel compilers before version 13.0"
# endif
#endif

[/cpp]

Jim_S_Intel · ‎04-16-2013

The DPRNG code relies on pedigrees, which is a feature that is implemented only in recent versions of the Intel compiler that support Cilk Plus. Do you know which version of icc / icl are you using? (What is the version string printed out at the command line?)

The code that is currently posted only works with 13.0 compilers or later, because it uses an updated interface for accessing pedigrees.
There is a way to access pedigrees in 12.1 compilers, and in principle it should be possible to port the code to work with that version, but I haven't had a chance to try that yet.

Are you using Cilk Plus with a v11 compiler? I was under the impression that Cilk keywords were not even implemented yet until at least the 12.0 compiler. The DPRNG code is designed to work only in the context of Cilk Plus.

Jim

Ivan_M_ · ‎04-17-2013

ok I verified, I have 12.1.3.300 build 20120130