Ok, thank you for the

George · ‎04-03-2015

Hello ,

I am running a code in openMP which is like this:

#pragma omp parallel for default( none ) shared( X , Y ,V ,V ,H , W ,N ) private ( i,x,y ,Kx,Ky,initD ,T ) 

		for ( y = 0; y < H; y++ )
		{
			for ( x = 0; x < W; x++ )
			{

				initD = aValue;
				for ( i = 0; i < N; i++ )
				{
					.....Kx,Ky...
                                        ...X,Y..
				} 
			
				V[ x + y * Width ] = T;
			
			} 
		
		}

Now , I want to run it on mic card , so when I just add the line:

#pragma offload target (mic) in ( X:length( W ) ) in ( Y:length( H ) ) out ( V:length( W * H) )

the performance drops drammatically!

What should I pay attention to?What changes do I have to make?

Thank you!

jimdempseyatthecove · ‎04-03-2015

How long does the loop run on the Host?

What are the Lengths?

Is your for(i= loop code vectorizable? If not, can you make it vectorized?

Can you fuse the y and x loops? IOW partitioning on H alone may not yield a balanced workload with 244 threads (or whatever number you are using on the MIC).

Jim Dempsey

George · ‎04-03-2015

Hello ,

On the host it takes around 7-8 msec and on mic around 120-130 msec.

The

H = 256 , W = 256 , N = 80

I am using 4 threads for now ( it has better timing ).

The for i loop is like:

for ( i = 0; i < N; i++ )
{
	Kx = X[ i ] - x;
	Ky = Y[ i ] - y;
	D = KX * KX + KY * KY;

	if ( D < initD )
	{
	    initD = D;
	    T = V[ i ];
	}
				
}

I have allocated space for the arrays like:

float * X = (float*) _mm_malloc( NN * sizeof (*X) ,32 );

Hmm, I so the report and it seems that it is not vectorized..

loop was not vectorized: vector dependence prevents vectorization

vector dependence: assumed ANTI dependence

It is between lines :

T = V[ i ];

and

V[ x + y * Width ] = T;

and also between :

D = KX * KX + KY * KY;

and

if ( D < initD )

How to overcome this?

I even used the 'restrict' keywork for the arrays and also

 pragma ivdep

above the loops

Thank you!

TimP · ‎04-03-2015

While 32-byte alignment is good for current host CPUs, MIC alignment must be to 64-byte boundaries to be of any use.

If you can't vectorize and use at least 32 threads effectively, running on MIC is not likely to be effective. I find that offload mode doesn't use as many threads effectively on MIC as native mode.

George · ‎04-03-2015

Ok, I changed everything from float to double and alignment to 64.

Still ,I have no vectorization...

But now , it shows me:

outer loop was not auto-vectorized: consider using SIMD directive

in line :

for ( y = 0; y < H; y++ )

but ,if I use #pragma simd above it , I am receiving : offload error: process on the device 0 was terminated by signal 8 (SIGFPE)

and:

vector dependence: assumed ANTI dependence between lines

D = KX * KX + KY * KY;

and

if ( D < initD )

Charles_C_Intel1 · ‎04-03-2015

Greetings:

I wish to question why you are trying to offload this code in the first place when you say it runs in "7-8 msec and on mic around 120-130 msec". If this is the only offload your code does, before it can run on the Xeon Phi you need to create a process, start up 240 threads, allocate memory across all those threads, etc. That takes time....you might want to time how long it takes to offload a null statement. If this is a proxy for a real algorithm that runs for a much longer time, then it may make sense to try to offload it. Your code fragment also shows no parallelism - where serial performance is concerned, it is known that Xeon Phi is considerably slower than Xeon.

My *personal* rules of thumb are:

If it takes less than 2 seconds on the host, it doesn't make sense to offload to Xeon Phi
If the code isn't well-optimized on Xeon Phi, it likely doesn't make sense to offload it to Xeon Phi
Be careful to separate the timing of code execution on Xeon Phi from the time it takes to do data transfer and offload setup, but know what all these separate times are. They can help you decide whether it makes sense of offload to Xeon Phi or not - if you have the code fully threaded and vectorized, you need to make sure the performance benefit isn't overwhelmed by talking to the card. When benchmarking something like a DGEMM, for example, you need to separate the data transfer into separate offload statements from the compute kernel and warm up OpenMP ahead of the DGEMM so that you can accurately measure the true compute time.

Charles

jimdempseyatthecove · ‎04-03-2015

for(int timeRep = 0; timeRep < 5; ++ timeRep)
{
  double t0 = omp_get_wtime();
  #pragma offload target ...
  {
  ...
  }
  printf("%d = %f\n", timeRep, omp_get_wtime() - t0);
}

Your first offload will incur the overhead of:

a) copying the .exe to the MIC
b) starting the MIC OpenMP thread pool
c) incurring the memory mapping for the first touch of the MIC process virtual memory

The second and later calls will not.

Jim Dempsey

jimdempseyatthecove · ‎04-03-2015

>>V[ x + y * Width ] = T;

The compile will (may) have insufficient information to know for a fact that V[ x + y * Width ] does not overlap amongst the threads and/or vector lanes.

You may need to rewrite the statements such that any ambiguities are resolved and/or exposed as being safe or not safe for vectorization and threading.

Jim Dempsey

George · ‎04-03-2015

Charles Congdon

Running the same code using cuda ,it takes 0.25 msec.

I want to try it using xeon phi.

On the host ,it takes 7-8 msec , so , I want to know if it can perform better on xeon phi.

Now , in cuda , I am doing something like:

if ( x < blockDim.x * gridDim.x && y < blockDim.y * gridDim.y ) 
{
	
	for ( int i = 0; i < N; i++ )
	{
		KX = X[ i ] - x;
		KY = Y[ i ] - y;
....

where x and y are threads.

So,I tried the approach I have written in above posts.

Thank you for te suggestions and tips.

If you can help me with the timing measurement because I am not sure how to do it right.

Jim Dempsey

So, I am not measuring right?

I have all the above in a function and I am measuring the function.

Thank you!

George · ‎04-03-2015

You may need to rewrite the statements such that any ambiguities are resolved and/or exposed as being safe or not safe for vectorization and threading.

I have declared V as restrict and x,y,T as private.

If you can give me some more specific directions/examples to accomplish what you said ,I'll appreciate.

Thanks!

jimdempseyatthecove · ‎04-03-2015

George,

Did you insert the timing loop?

Or time at least two offloads?

What is the time of the second offload?

The first offload of a host process has significant overhead as listed in #7 above.

Now, if you want to fool yourself into thinking it is running faster you can set the environment variable:

OFFLOAD_INIT=on_start

What this will do is copy the offload version of your process (the .exe that runs on MIC) to the MIC(s) _prior_ to entering main().

You won't run any faster, but your first offload timing will report faster.

You will still have the first time offload to create the OpenMP thread pool. If you want to easily remove this overhead from your timing:

int main(...)
// Light-weight code to initialize the MIC and start the OpenMP thread pool on the MIC
#pragma offload target (mic)
{
#pragma omp parallel
omp_get_wtime(); // some fast non-null innocuous statement
}
... now complete the remainder of your test program _after_ initialization overhead

Jim Dempsey

George · ‎04-04-2015

Ok, thank you for the comments.I will check it next week and apply them.

Reagrding the vectorization problem ,do you have any ideas?

If you can check the cuda code I am trying to port to xeon phi. ( note ,that x and y are threads ,but not i )

And what I did so far ( I am writting all of this info in my above posts ).

Thank you very much!

jimdempseyatthecove · ‎04-04-2015

I'd have to see the inner loop. Sometimes you have to help the compiler out:

#pragma omp parallel for default( none ) shared( X , Y ,V ,V ,H , W ,N ) private ( Kx,Ky,T ) 
for (int y = 0; y < H; y++ )     // scope y to the for, not to the region
{
  int yStride = y * Width;       // help the compiler to know this is (inner) loop invariant
  for (int x = 0; x < W; x++ )   // scope x to this loop not to the region
  {
    int initD = aValue;          // help the compiler to know this is (inner) loop invariant
    for (int i = 0; i < N; i++ ) // scope x to this loop not to the region
    {
      .....Kx,Ky...
      ...X,Y..
    } 
    V[ x + yStride ] = T;        // use loop invariant yStride
  } 
}

You may also want to consider declaring Kx, Ky and T internal to the scope where they are used.

When the compiler can determine that a variable goes out of scope in a loop (evaporates so to say), then the compiler knows that the final value generated inside the loop will not be used outside the loop. This can assist it in its opportunity assessments of optimizing the loop.

Jim Dempsey

George · ‎04-05-2015

Hello and thanks for the help!

I saw that you have as privates the Kx,Ky and T instead of i,x,y,initD ,Kx,Ky,T.

So,in contrast to the cuda version which uses threads for x and y , we are using threads for the computation inside the i loop.

So,generally ,is it better to keep the scope as 'local' as possible?

I will try it next week , thank you.

TimP · ‎04-05-2015

Omp parallel should be better on the outer loop as you show it. Declaration in the outer loop scope is cleaner but has same effect as private.

George · ‎04-07-2015

Jim Dempsey

I didn't notice any difference using this version.Still not vectorized..

I used the Light-weight code to initialize the MIC , ok .

Thank you!

Omp parallel should be better on the outer loop as you show it. Declaration in the outer loop scope is cleaner but has same effect as private.

Ok , thank you.I didn't notice any difference though.

adding offload pragma , performance drops