Excellent!

George · ‎04-02-2015

Hello ,

I am new to xeon phi and cilk and I can't find a way to run an inclusive scan ( like in thrust for example ) in C using cilk.

Does something like this exist? ( I am not speaking for just about a sum/reduce )

Or , someone has a code of this?

Thanks!

TimP · ‎04-02-2015

I haven't seen so much as a Cilk(tm) Plus extension of partial_sum(). That shows about 10% performance gain over scalar C code on my MIC tests. Granted, if it's important to you to run such algorithms on MIC, additional optimization is sorely needed.

The gcc cilkplus variety has narrower limits on reducer syntax than Intel Cilk(tm) Plus The notation may not become popular if it's not supported in these various flavors.

An initial brute force method might be to try accumulate() inside a cilk_for(), or, for better performance, omp parallel for.outer loop with simd reduction inner loop, bearing in mind the KMP_BLOCKTIME delay in transitioning from OpenMP parallel to Cilk(tm) Plus parallel library.

My benchmark is 25 years old. I don't think this "inclusive scan" terminology was invented nearly that long ago. "scan" always used to mean something more like what it did in Cobol, for us old fogies.

George · ‎04-02-2015

Hello ,

thanks for the suggestions.

Can you please tell me this?

How can I store the results of a reduce operation in an array?

If I have something like this:

#pragma offload target (mic) inout ( A:length(N) )
    {
       
        __sec_reduce_add( A[0:N] );

    }

I want to store every result in an array.

I can't figure!

Thanks!

jimdempseyatthecove · ‎04-02-2015

If what you have is like what you've shown, then the time for the transfer in and out exceed the time of the operation.

An alternative is to keep the pertinent data inside the MIC. After initialization of the data in the MIC, your offload for reductions would not pass in nor out the data, but could simply perform the reduction then return the result(s). At some point you may want to extract the entirety of the data, but you may not need to do this at every integration step.

Jim Dempsey

George · ‎04-02-2015

Hello and thanks for helping.

Because I am not experienced can you please provide me a piece of code?

Ok, for the in/out , I will change to

#pragma offload target (mic) in ( A:length(N) )

but how am I going to keep track?

Doing something like:

#pragma offload target (mic) in ( A:length(N) ) out ( ouRes:length(N))
 {
	        
	 ouRes[ 0:N ] =   __sec_reduce_add( A[0:N] );
	 
 }

won't work as intendeed.

The problem is that __sec_reduce_add performs all the summation and returns the final result.

How can I do what I want?

Thank you.

TimP · ‎04-02-2015

Taking an example from my posts:

       if(m > 201)cilk_for (int i__ = 1; i__ <= m; ++i__)
           a[i__] += __sec_reduce_add(b[i__:m]*c__);
       else .....

The threshold 201 for invoking cilk_for is somewhat arbitrary, probably too low for MIC. This example already is one which is rejected by gcc -fcilkplus .

As Jim mentioned, much higher thresholds are required before you want to move data between host and MIC in offload or cilk modes.

If you think the syntax for parallelizing partial_sum() is too complicated, I won't disagree, but it might be as simple as (not tested, not optimized)

cilk_for(int i=0; i<N; ++i) b = __sec_reduce_add(a[0:i]);

In OpenMP, you would try some schedule variations such as guided, auto, dynamic(2) to see what gives acceptable performance.

George · ‎04-02-2015

Hello ,

Using :

cilk_for(int i=0; i<N; ++i) b = __sec_reduce_add(a[0:i]);

results in very poor performance either using

#pragma offload target (mic)

or not.

In openMP we are using:

#pragma omp parallel for reduction(+:res)
for ( int i = 0; i < N; i++ )
{
	res += ioA[ i ]; 
}

So , I am not sure how/if we can use an array , like :

res[ i ] += ioA[ i ];

Thank you.

TimP · ‎04-02-2015

Is your serial code like the following?

for ( int i = 0; i < N; i++ )

b[i} = (res += ioA[ i ]);

TimP · ‎04-02-2015

In my test on MIC B0, for N==1000, the embarrassingly simplified cilk_for takes 10x as long as the single thread partial_sum. I suppose optimization of this is an academic subject of masters theses etc and I don't care to reinvent wheels which appear to have been re-rolled in the published summary of CUDA versions.

TimP · ‎04-02-2015

Source for g++ openmp partial_sum may be found in a g++ installation e.g.

/usr/local/gcc5.0/include/c++/5.0.0/parallel/partial_sum.h

which should be activated according to

https://gcc.gnu.org/onlinedocs/libstdc++/manual/parallel_mode_using.html

(although I don't see anything useful happening).

I don't find this in icc installation.

George · ‎04-02-2015

Hello ,

My serial code is:

for ( int i = 0; i < N; i++ )

   res[ i ] += ioA[ i ];

If I use :

cilk_for(int i=0; i<N; ++i) res = __sec_reduce_add(a[0:i]);

it is very very slow. If I use of course just:

float res = __sec_reduce_add(a[0:N])  ( without the loop and note I am using a[0:N] )

it is very fast.

Regarding openMP implementation , as I asked above in previous post, do you know anything?

How to use an array to keep the sums?

jimdempseyatthecove · ‎04-02-2015

George,

Your serial code is wrong. Use:

prior = 0;
for(i=0; i < N; ++i)
prior = res = prior + ioA;

Parallelizing this is problematic due to the next result being dependent upon the prior result.

While this is not impossible, it is rather difficult and it introduces some redundant additions.

You will also need cutoffs

if(N < Nserial)
 doSerial(...);
else
if(N < Nvector)
  doSerialVector(..);
else
if(N < NopenMPvector)
  doOpenMPvector(...);
else
  doOffloadVector(...);
endif

Jim Dempsey

George · ‎04-02-2015

Hello,

Actually , I meant this ,ok:

#include <stdio.h>
#include <stdlib.h>


int main( int argc, const char* argv[] )
{

    int N =10;
    float * ioA = (float*) malloc( N * sizeof(*ioA) );
    float * Res = (float*) malloc( N * sizeof(*Res) );
    
    
    for ( int i = 0; i < N; i++ )
    {
        ioA[ i ] = i;
        Res[ i ] = 0;
    }
    
    
    float sum = 0;
    for ( int i = 0; i < N; i++ )
    {
        sum += ioA[ i ];
        Res[ i ] = sum;
    }
    
    for ( int i = 0; i < N; i++ )
        printf("\n Res[%d] = %f\n",i,Res[ i ] );
    
    free( ioA );
    free( Res );
    
    return 0;

}

So , there is no easy way to do it using openMP or cilk ...

Ok, thank you.

jimdempseyatthecove · ‎05-22-2015

George,

I re-though the inclusive scan situation and came up with a rather good solution to vectorization and parallelization of this "Elusive Algorithm". This yields ~89x performance improvement on large problems (larger than 2x of sum of LLCs) on Xeon Phi 5110P. On smaller problem sizes over 200x.

Jim Dempsey

George · ‎05-22-2015

Excellent!

Thank you very much!

jimdempseyatthecove · ‎05-23-2015

George,

Could you report back as to performance impact this has on an actual application (yours) using inclusive scan? My test program provides the raw performance difference of just this function (while varying the amount of data). It provides no means to assess how it impacts an application in total.

Jim Dempsey

George · ‎05-24-2015

Hello ,

Unfortunately , I don't have access any more to xeon phi console.

If I have in the near future I will let you know!

Thank you very much for your help and concern.

how to perform inclusive scan in C cilk