- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hello,
I have the problem with using cilk plus. could you help me, please ?
G class contains the matrix (as a vector) GenoMatrix[nIndividuals * nMarkers], where nInidividuals = 5000 and nMarkers = 50000
there is also the int * Criteria::solution[nIndividuals] variable of the Criteria class
I try to use cilk plus to accelerate the code but I have a data race problem with shared vectors (solution, GenoMatrix) and I don't know how to solve it correctly.
Thank you for your aid !
[cpp]
int Criteria::UncoveredMarkers(int alleleType)
{
cilk::reducer_opadd<int> presented(0);
cilk_for (int j=0; j<G->nMarkers; ++j)
{
for (int i=0; i<G->nIndividuals; ++i)
if ( (solution == alleleType) && (G->getMarker(i,j) == alleleType) )
{
++presented;
break;
}
}
return G->nMarkers - presented.get_value();
}
[/cpp]
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
please, find the VS project in attachments. hotspot is "int UncoveredMarkers(int)" (criteria.cpp) function
I've tried cilk and openmp the both do not give valide results:
- the computations are not valide
- the computing time is more than a sequential version of the code
I guess that there are data race and a big overhead but I don't understand how to solve it.
in the current code I delete all cilk and openmp options, in order to have a numerically valide start point.
in the "calibration.cpp" file, for the test purposes one can generate a random data matrix
[cpp]
Genotype* G = new Genotype(n_Individuals, n_Markers, NA_ratio);
//G->Save("Genotype.txt");
//Genotype* G = new Genotype(genotypeFile, n_Individuals, n_Markers);
[/cpp]
[cpp]
int Criteria::UncoveredMarkers(int alleleType)
{
int presented = 0;
for (int j=0; j<G->nMarkers; ++j)
for (int i=0; i<G->nIndividuals; ++i)
if ( (solution == alleleType) && (G->getMarker(i,j) == alleleType) )
{ ++presented; break; }
return G->nMarkers - presented;
}
[/cpp]
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Can you show your member funciton getMarker(i,j)?
The way your loop is written, you want to assure getMarker(i,j) and getMarker(i+1,j) are in adjacent memory locations for good cache locality.
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
[cpp]
int *GenoMatrix;
void Genotype::generateGenotype()
{
int size = nIndividuals * nMarkers;
GenoMatrix = new int[size];
Random rng;
//cilk_for(int i=0; i<size; ++i)
for(int i=0; i<size; ++i)
{
if (rng.randFloat01() > NA_ratio)
{
float r = rng.randFloat01();
if (r < 0.33f)
GenoMatrix = AA;
else if (r>= 0.66f)
GenoMatrix = BB;
else
GenoMatrix = AB;
}
else
{
GenoMatrix = NA;
}
//printf("%d\n", GenoMatrix);
}
return;
}
Genotype::Genotype(int nIndividuals, int nMarkers, float NA_ratio)
{
// set sizes
Genotype::nIndividuals = nIndividuals;
Genotype::nMarkers = nMarkers;
Genotype::NA_ratio = NA_ratio;
// init alleles
AA = 0;
AB = 1;
BB = 2;
NA = 999;
// generate matrix
generateGenotype();
}
int Genotype::getMarker(int const index, int const position)
{
return GenoMatrix[nMarkers*index + position];
}
[/cpp]
for the test purpose genotype matrix is generated randomly, buy also it can be read from txt file.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
[cpp]
// outer loop
cilk_for (int j=0; j<G->nMarkers; ++j)
{
// inner loop
for (int i=0; i<G->nIndividuals; ++i) // v inner lcv on left
if ( (solution == alleleType) && (G->getMarker(i,j) == alleleType) )
...
// inner lcv ----------------------v outer lcv -----v
int Genotype::getMarker(int const index, int const position)
{
// inner lcv taking large stride
return GenoMatrix[nMarkers*index + position];
}
[/cpp]
At issue is as the inner loop iterates it is taking a large stride through GenoMatrix.
Consider swapping the index method by using:
[cpp]
return GenoMatrix[nIndividuals * position + index];
[/cpp]
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Jim is right, that you may see a performance benefit in swapping your index method.
I tried it quickly though, and I did not notice much of an impact in terms of getting speedup from parallelization. With regards to parallelization, I think the bigger issue in this case is that each call to the UncoveredMarkers function does not do enough work to be parallelized effectively.
When I tried running with the following parameters: "Genotype.txt 30 250 0.85 0.75 100 100", I noticed that "nMarkers = 250" and "nIndividuals=30". There seemed to be little work being done in each call to UncoveredMarkers. When I timed each call in the serial code, each call was taking < 20 us to finish. In this case, I would expect the overheads of distributing such fine-grained work are going to prevent you from seeing any gains for parallelizing within this call.
With Cilk Plus, you generally want to parallelize the "outermost" functions in the code, i.e., the ones closest to main, because those are the ones that usually correspond to the largest tasks, and parallelizing them adds the least overhead. Here are a few other suggestions:
1. I noticed that in algorithm.cpp, in the Run() method, you have a for loop over each of the NP individuals. I couldn't quite tell whether there was a depedency between iterations, but if you can parallelize that loop, that would probably have the biggest effect. (I'm assuming that there is a dependency between generations, and there is no way to restructure your problem to parallelize that loop.) You might need to create a separate "trial" array for each individual, and perhaps the "fnc" variable could be a max reducer?
2. Another thing to watch out for if you parallelize that loop, is that the random-number generator (RNG) should not be shared between iterations. Sharing the RNG would be (a) nondeterministic, because there is a race on accessing the RNG state, and the random numbers you get might depend on how the execution is interleaved, and (b) it may hurt performance because of the sharing.
In your case, it is might not be too bad to create a RNG for each individual at the beginning of the program?
But as a side note, I might as well mention that I have actually written a recent article about deterministic random-number generation in Cilk Plus. :) If you are using a recent version of icc, we have posted some library code online that you can use as well.
http://software.intel.com/en-us/articles/a-dprng-for-cilk-plus
The library code online is intended to make it easier to take a normal RNG, which might be declared as a global variable and accessed in many places in a program, and replace that RNG with a deterministic parallel RNG. This approach avoids the sharing problems without requiring the user to figure out how manually to duplicate the RNGs to avoid the sharing.
3. You might also try parallelizing the fObj method:
int xAA = cilk_spawn UncoveredMarkers(G->AA);
int xAB = cilk_spawn UncoveredMarkers(G->AB);
int xBB = cilk_spawn UncoveredMarkers(G->BB);
cilk_sync;
return xAA + xAB + xBB + ksi*(nAA + nAB + nBB);
It might help a little bit, if you are running on a small number of workers (e.g., less than 4), but parallelizing the loop over individuals should be much more beneficial.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Dear Jim S. thank you for your detailed explanations and advices. I'll try all this.
Here is I'll publish the explanations of Jim D. about outer/inner loops
Your GenoMatrix is a single block of memory with size = nIndividuals * nMarkers
You must make a design decision as to which (marker or individual) is to lie in adjacent physical locations
(marker,individual), (marker+1, individual), (marker+2, individual), ...
.OR.
(marker, individual), (marker, individual+1), (marker, individual+1), ...Whichever you choose to use as adjacent in memory, then that index should be the inner loop control variable.
Change your indexing for getMarker(int const index, int const position) and putMarker(int const index, int const position) to both use the same indexing scheme [nMarkers*index + position] with inner loop varying position .OR. [nIndividuals * position + index] with inner loop varying index. This may necessitate swapping loop orders.
In your original code snip, the inner and outer loop control variables were reversed. One of the two loops is non-optimal.
in 2nd follow-up with getMarker, it illustrated that you used the outer loop control variable to make adjacent steps. And subsiquently the inner loop control variable was making large strides. This is not cache line friendly.
Second point, never include code containing critical sections within your benchmarking (unless that code is required for your applicaiton). In your sample code you were using a random number generator. These typically have a critical section, and thus are not representative of parallel code without a critical section.
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I resume what I did.
I've changed indexing in getMarker and putMarker (see Jim Dempsey advice). It helps to accelerate.
I tested cilk_spawn in fObj, it increases the speed, but will not suitable on server with core >4 (my case)
Now I'm trying to parallize the second loop for in algorithm.cpp Run() function.
To have a rapid solution with random numbers, I used NP array of random class.
I created two arrays current(pop) and new(newpop) to store new solutions into newpop, and make swap of pop and newpop and the end of each generation g.
A add cilk::min_index reducer (b) in order to save the best solution.
I tried tbb::mutex for the line
[cpp]
if (fnc <= fpop[best]) best = i; // update best
[/cpp]
it gave compilation error.
the final result is not deterministic! and I cannot understand why.
I attached the algorthm.cpp file
I have some other questions to Jim S. not concerning this algorithm but more general ones. Jim, I read your paper, as well as its scientific version with the ppt presentation, it is very interesting and I will try your DPRNG a little bit later. I'm very curious what kind of scheduling algorithm is used for dynamic multitreading? would Cilk technology applicable to XeonPhi co-processor? thank you!
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I took a quick scan through your code (but didn't try running or compile anything). I noticed a few things which might explain some of the behavior you are seeing.
1. The use of the reducer_min_index reducer "b" is not quite right. There are calls to b.get_value() and b.set_value() within the cilk_for loop, which is a race. You should only call the "get_value / get_index / set_value / set_index" methods on the reducer outside the parallel region, after all the updates to the reducer are complete. Instead, inside the parallel region (the cilk_for), you should call "calc_min" to update the value.
The "get_value" method is really intended to be used only to get the value out when you are done. When the get_value() method is called inside the cilk_for, you may not get the right answer.
I have attached a simple code example that uses reducer_min_index, which may help.
2. When you created the multiple RNG objects, did you also remember to modify your random.h, to use "rand_r" instead of just "rand"? If you are still using rand, then you still have multiple RNG objects wrapping the same underlying rand(), instead of creating separate random number generators.
3. I'm not quite sure what error you are getting using tbb::mutex, but once you have the reducer working correctly, you should not need locks.
4. If you try out the DPRNG code, I'd be interested to know how it works, if you run into any problems, or have suggestions for improvements! :)
5. The different variants of Cilk (including Cilk Plus) are instances of dynamic multithreading languages / platforms. Many of these platforms use a randomized work-stealing scheduler for load balancing. There are a few entries in our FAQ about how the Cilk Plus runtime does task scheduling.
http://www.cilkplus.org/faq/20
6. Cilk Plus does work on the Xeon Phi co-processor. We also talk a little bit more about it in our FAQ.
http://www.cilkplus.org/faq/26
Cheers,
Jim S.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Jim Sukha (Intel) wrote:
If you try out the DPRNG code, I'd be interested to know how it works, if you run into any problems, or have suggestions for improvements! :)
Jim, when you compared your DPRNG with MT algorithm, which implementation of the MT you used ?
I asked this question because there are so many implementations: SFPMT, SSE2_MT,... and many more
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
When we ran our experiments, we were comparing against the implementation in the GNU scientific library. (gsl_rng_mt19937)
http://www.gnu.org/software/gsl/manual/html_node/Random-number-generator-algorithms.html
It has been a while, but I believe we chose that one mostly out of convenience, rather than for any significant technical or performance reasons. Cheers,
Jim
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
is it possible to run DPRNG with v11 of compilers ?
[cpp]
#if defined(__INTEL_COMPILER)
# if (__INTEL_COMPILER < 1300)
# pragma message("__INTEL_COMPILER is" __INTEL_COMPILER)
# error "<cilkpub/pedigrees.h> may not work using Intel compilers before version 13.0"
# endif
#endif
[/cpp]
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
The DPRNG code relies on pedigrees, which is a feature that is implemented only in recent versions of the Intel compiler that support Cilk Plus. Do you know which version of icc / icl are you using? (What is the version string printed out at the command line?)
The code that is currently posted only works with 13.0 compilers or later, because it uses an updated interface for accessing pedigrees.
There is a way to access pedigrees in 12.1 compilers, and in principle it should be possible to port the code to work with that version, but I haven't had a chance to try that yet.
Are you using Cilk Plus with a v11 compiler? I was under the impression that Cilk keywords were not even implemented yet until at least the 12.0 compiler. The DPRNG code is designed to work only in the context of Cilk Plus.
Jim
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
ok I verified, I have 12.1.3.300 build 20120130

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page