- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hello ,
I am new to xeon phi and cilk and I can't find a way to run an inclusive scan ( like in thrust for example ) in C using cilk.
Does something like this exist? ( I am not speaking for just about a sum/reduce )
Or , someone has a code of this?
Thanks!
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I haven't seen so much as a Cilk(tm) Plus extension of partial_sum(). That shows about 10% performance gain over scalar C code on my MIC tests. Granted, if it's important to you to run such algorithms on MIC, additional optimization is sorely needed.
The gcc cilkplus variety has narrower limits on reducer syntax than Intel Cilk(tm) Plus The notation may not become popular if it's not supported in these various flavors.
An initial brute force method might be to try accumulate() inside a cilk_for(), or, for better performance, omp parallel for.outer loop with simd reduction inner loop, bearing in mind the KMP_BLOCKTIME delay in transitioning from OpenMP parallel to Cilk(tm) Plus parallel library.
My benchmark is 25 years old. I don't think this "inclusive scan" terminology was invented nearly that long ago. "scan" always used to mean something more like what it did in Cobol, for us old fogies.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hello ,
thanks for the suggestions.
Can you please tell me this?
How can I store the results of a reduce operation in an array?
If I have something like this:
#pragma offload target (mic) inout ( A:length(N) ) { __sec_reduce_add( A[0:N] ); }
I want to store every result in an array.
I can't figure!
Thanks!
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
If what you have is like what you've shown, then the time for the transfer in and out exceed the time of the operation.
An alternative is to keep the pertinent data inside the MIC. After initialization of the data in the MIC, your offload for reductions would not pass in nor out the data, but could simply perform the reduction then return the result(s). At some point you may want to extract the entirety of the data, but you may not need to do this at every integration step.
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hello and thanks for helping.
Because I am not experienced can you please provide me a piece of code?
Ok, for the in/out , I will change to
#pragma offload target (mic) in ( A:length(N) )
but how am I going to keep track?
Doing something like:
#pragma offload target (mic) in ( A:length(N) ) out ( ouRes:length(N)) { ouRes[ 0:N ] = __sec_reduce_add( A[0:N] ); }
won't work as intendeed.
The problem is that __sec_reduce_add performs all the summation and returns the final result.
How can I do what I want?
Thank you.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Taking an example from my posts:
if(m > 201)cilk_for (int i__ = 1; i__ <= m; ++i__)
a[i__] += __sec_reduce_add(b[i__:m]*c__
else .....
The threshold 201 for invoking cilk_for is somewhat arbitrary, probably too low for MIC. This example already is one which is rejected by gcc -fcilkplus .
As Jim mentioned, much higher thresholds are required before you want to move data between host and MIC in offload or cilk modes.
If you think the syntax for parallelizing partial_sum() is too complicated, I won't disagree, but it might be as simple as (not tested, not optimized)
cilk_for(int i=0; i<N; ++i) b = __sec_reduce_add(a[0:i]);
In OpenMP, you would try some schedule variations such as guided, auto, dynamic(2) to see what gives acceptable performance.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hello ,
Using :
cilk_for(int i=0; i<N; ++i) b = __sec_reduce_add(a[0:i]);
results in very poor performance either using
#pragma offload target (mic)
or not.
In openMP we are using:
#pragma omp parallel for reduction(+:res) for ( int i = 0; i < N; i++ ) { res += ioA[ i ]; }
So , I am not sure how/if we can use an array , like :
res[ i ] += ioA[ i ];
Thank you.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Is your serial code like the following?
|
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
In my test on MIC B0, for N==1000, the embarrassingly simplified cilk_for takes 10x as long as the single thread partial_sum. I suppose optimization of this is an academic subject of masters theses etc and I don't care to reinvent wheels which appear to have been re-rolled in the published summary of CUDA versions.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Source for g++ openmp partial_sum may be found in a g++ installation e.g.
/usr/local/gcc5.0/include/c++/5.0.0/parallel/partial_sum.h
which should be activated according to
https://gcc.gnu.org/onlinedocs/libstdc++/manual/parallel_mode_using.html
(although I don't see anything useful happening).
I don't find this in icc installation.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hello ,
My serial code is:
for ( int i = 0; i < N; i++ ) res[ i ] += ioA[ i ];
If I use :
cilk_for(int i=0; i<N; ++i) res = __sec_reduce_add(a[0:i]);
it is very very slow. If I use of course just:
float res = __sec_reduce_add(a[0:N]) ( without the loop and note I am using a[0:N] )
it is very fast.
Regarding openMP implementation , as I asked above in previous post, do you know anything?
How to use an array to keep the sums?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
George,
Your serial code is wrong. Use:
prior = 0;
for(i=0; i < N; ++i)
prior = res = prior + ioA;
Parallelizing this is problematic due to the next result being dependent upon the prior result.
While this is not impossible, it is rather difficult and it introduces some redundant additions.
You will also need cutoffs
if(N < Nserial) doSerial(...); else if(N < Nvector) doSerialVector(..); else if(N < NopenMPvector) doOpenMPvector(...); else doOffloadVector(...); endif
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hello,
Actually , I meant this ,ok:
#include <stdio.h> #include <stdlib.h> int main( int argc, const char* argv[] ) { int N =10; float * ioA = (float*) malloc( N * sizeof(*ioA) ); float * Res = (float*) malloc( N * sizeof(*Res) ); for ( int i = 0; i < N; i++ ) { ioA[ i ] = i; Res[ i ] = 0; } float sum = 0; for ( int i = 0; i < N; i++ ) { sum += ioA[ i ]; Res[ i ] = sum; } for ( int i = 0; i < N; i++ ) printf("\n Res[%d] = %f\n",i,Res[ i ] ); free( ioA ); free( Res ); return 0; }
So , there is no easy way to do it using openMP or cilk ...
Ok, thank you.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
George,
I re-though the inclusive scan situation and came up with a rather good solution to vectorization and parallelization of this "Elusive Algorithm". This yields ~89x performance improvement on large problems (larger than 2x of sum of LLCs) on Xeon Phi 5110P. On smaller problem sizes over 200x.
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Excellent!
Thank you very much!
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
George,
Could you report back as to performance impact this has on an actual application (yours) using inclusive scan? My test program provides the raw performance difference of just this function (while varying the amount of data). It provides no means to assess how it impacts an application in total.
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hello ,
Unfortunately , I don't have access any more to xeon phi console.
If I have in the near future I will let you know!
Thank you very much for your help and concern.
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page