Software Archive
Read-only legacy content
17061 Discussions

adding offload pragma , performance drops

George
Beginner
1,326 Views

Hello ,

I am running a code in openMP which is like this:


 

#pragma omp parallel for default( none ) shared( X , Y ,V ,V ,H , W ,N ) private ( i,x,y ,Kx,Ky,initD ,T ) 

		for ( y = 0; y < H; y++ )
		{
			for ( x = 0; x < W; x++ )
			{

				initD = aValue;
				for ( i = 0; i < N; i++ )
				{
					.....Kx,Ky...
                                        ...X,Y..
				} 
			
				V[ x + y * Width ] = T;
			
			} 
		
		} 

 

Now , I want to run it on mic card , so when I just add the line:

#pragma offload target (mic) in ( X:length( W ) ) in ( Y:length( H ) ) out ( V:length( W * H) )

 the performance drops drammatically!

What should I pay attention to?What changes do I have to make?

 

Thank you!

 

0 Kudos
15 Replies
jimdempseyatthecove
Honored Contributor III
1,326 Views

How long does the loop run on the Host?

What are the Lengths?

Is your for(i= loop code vectorizable? If not, can you make it vectorized?

Can you fuse the y and x loops? IOW partitioning on H alone may not yield a balanced workload with 244 threads (or whatever number you are using on the MIC).

Jim Dempsey

0 Kudos
George
Beginner
1,326 Views

Hello ,

On the host it takes around 7-8 msec and on mic around 120-130 msec.

The

H = 256 , W = 256 , N = 80

I am using 4 threads for now ( it has better timing ).

The for i loop is like:

 

for ( i = 0; i < N; i++ )
{
	Kx = X[ i ] - x;
	Ky = Y[ i ] - y;
	D = KX * KX + KY * KY;

	if ( D < initD )
	{
	    initD = D;
	    T = V[ i ];
	}
				
}

I have allocated space for the arrays like:

float * X = (float*) _mm_malloc( NN * sizeof (*X) ,32 );

Hmm, I so the report and it seems that it is not vectorized..

 

loop was not vectorized: vector dependence prevents vectorization

 vector dependence: assumed ANTI dependence

It is between lines :

T = V[ i ];

and

V[ x + y * Width ] = T;

 

and also between :

D = KX * KX + KY * KY;

and

if ( D < initD )

 

How to overcome this?

I even used the 'restrict' keywork for the arrays and also

 pragma ivdep

above the loops

Thank you!

0 Kudos
TimP
Honored Contributor III
1,326 Views

While 32-byte alignment is good for current host CPUs, MIC alignment must be to 64-byte boundaries to be of any use.

If you can't vectorize and use at least 32 threads effectively, running on MIC is not likely to be effective.  I find that offload mode doesn't use as many threads effectively on MIC as native mode.

0 Kudos
George
Beginner
1,326 Views

Ok, I changed everything from float to double and alignment to 64.

Still ,I have no vectorization...

But now , it shows me:

outer loop was not auto-vectorized: consider using SIMD directive

in line :

for ( y = 0; y < H; y++ )

but ,if I use #pragma simd above it , I am receiving : offload error: process on the device 0 was terminated by signal 8 (SIGFPE)

and:

vector dependence: assumed ANTI dependence between  lines

D = KX * KX + KY * KY;

and

if ( D < initD )

 

0 Kudos
Charles_C_Intel1
Employee
1,326 Views

Greetings:

I wish to question why you are trying to offload this code in the first place when you say it runs in "7-8 msec and on mic around 120-130 msec".  If this is the only offload your code does, before it can run on the Xeon Phi you need to create a process, start up 240 threads, allocate memory across all those threads, etc.   That takes time....you might want to time how long it takes to offload a null statement.  If this is a proxy for a real algorithm that runs for a much longer time, then it may make sense to try to offload it.   Your code fragment also shows no parallelism - where serial performance is concerned, it is known that Xeon Phi is considerably slower than Xeon.

My *personal* rules of thumb are:

  • If it takes less than 2 seconds on the host, it doesn't make sense to offload to Xeon Phi
  • If the code isn't well-optimized on Xeon Phi, it likely doesn't make sense to offload it to Xeon Phi
  • Be careful to separate the timing of code execution on Xeon Phi from the time it takes to do data transfer and offload setup, but know what all these separate times are.  They can help you decide whether it makes sense of offload to Xeon Phi or not - if you have the code fully threaded and vectorized, you need to make sure the performance benefit isn't overwhelmed by talking to the card.  When benchmarking something like a DGEMM, for example, you need to separate the data transfer into separate offload statements from the compute kernel and warm up OpenMP ahead of the DGEMM so that you can accurately measure the true compute time.

Charles

0 Kudos
jimdempseyatthecove
Honored Contributor III
1,326 Views
for(int timeRep = 0; timeRep < 5; ++ timeRep)
{
  double t0 = omp_get_wtime();
  #pragma offload target ...
  {
  ...
  }
  printf("%d = %f\n", timeRep, omp_get_wtime() - t0);
}

Your first offload will incur the overhead of:

a) copying the .exe to the MIC
b) starting the MIC OpenMP thread pool
c) incurring the memory mapping for the first touch of the MIC process virtual memory

The second and later calls will not.

Jim Dempsey

0 Kudos
jimdempseyatthecove
Honored Contributor III
1,326 Views

>>V[ x + y * Width ] = T;

The compile will (may) have insufficient information to know for a fact that V[ x + y * Width ] does not overlap amongst the threads and/or vector lanes.

You may need to rewrite the statements such that any ambiguities are resolved and/or exposed as being safe or not safe for vectorization and threading.

Jim Dempsey

0 Kudos
George
Beginner
1,326 Views

Charles Congdon

Running the same code using cuda  ,it takes 0.25 msec.

I want to try it using xeon phi.

On the host ,it takes 7-8 msec , so , I want to know if it can perform better on xeon phi.

Now , in cuda , I am doing something like:

if ( x < blockDim.x * gridDim.x && y < blockDim.y * gridDim.y ) 
{
	
	for ( int i = 0; i < N; i++ )
	{
		KX = X[ i ] - x;
		KY = Y[ i ] - y;
....

where x and y are threads.

So,I tried the approach I have written in above posts.

Thank you for te suggestions and tips.

If you can help me with the timing measurement because I am not sure how to do it right.

 

Jim Dempsey

So, I am not measuring right?

I have all the above in a function and I am measuring the function.

 

Thank you!

 

 

0 Kudos
George
Beginner
1,326 Views

You may need to rewrite the statements such that any ambiguities are resolved and/or exposed as being safe or not safe for vectorization and threading.

I have declared V as restrict and x,y,T as private.

If you can give me some more specific directions/examples to accomplish what you said ,I'll appreciate.

Thanks!

0 Kudos
jimdempseyatthecove
Honored Contributor III
1,326 Views

George,

Did you insert the timing loop?

Or time at least two offloads?

What is the time of the second offload?

The first offload of a host process has significant overhead as listed in #7 above.

Now, if you want to fool yourself into thinking it is running faster you can set the environment variable:

OFFLOAD_INIT=on_start

What this will do is copy the offload version of your process (the .exe that runs on MIC) to the MIC(s) _prior_ to entering main().

You won't run any faster, but your first offload timing will report faster.


You will still have the first time offload to create the OpenMP thread pool. If you want to easily remove this overhead from your timing:

int main(...)
// Light-weight code to initialize the MIC and start the OpenMP thread pool on the MIC
#pragma offload target (mic)
{
#pragma omp parallel
omp_get_wtime(); // some fast non-null innocuous statement
}
... now complete the remainder of your test program _after_ initialization overhead

Jim Dempsey

0 Kudos
George
Beginner
1,326 Views

Ok, thank you for the comments.I will check it next week and apply them.

Reagrding the vectorization problem ,do you have any ideas?

If you can check the cuda code I am trying to port to xeon phi. ( note ,that x and y are threads ,but not i )

And what I did so far  ( I am writting all of this info in my above posts ).

Thank you very much!

 

0 Kudos
jimdempseyatthecove
Honored Contributor III
1,326 Views

I'd have to see the inner loop. Sometimes you have to help the compiler out:

#pragma omp parallel for default( none ) shared( X , Y ,V ,V ,H , W ,N ) private ( Kx,Ky,T ) 
for (int y = 0; y < H; y++ )     // scope y to the for, not to the region
{
  int yStride = y * Width;       // help the compiler to know this is (inner) loop invariant
  for (int x = 0; x < W; x++ )   // scope x to this loop not to the region
  {
    int initD = aValue;          // help the compiler to know this is (inner) loop invariant
    for (int i = 0; i < N; i++ ) // scope x to this loop not to the region
    {
      .....Kx,Ky...
      ...X,Y..
    } 
    V[ x + yStride ] = T;        // use loop invariant yStride
  } 
} 

You may also want to consider declaring Kx, Ky and T internal to the scope where they are used.

When the compiler can determine that a variable goes out of scope in a loop (evaporates so to say), then the compiler knows that the final value generated inside the loop will not be used outside the loop. This can assist it in its opportunity assessments of optimizing the loop.

Jim Dempsey

0 Kudos
George
Beginner
1,326 Views

Hello and thanks for the help!

I saw that you have as privates the Kx,Ky and T instead of i,x,y,initD ,Kx,Ky,T.

So,in contrast to the cuda version which uses threads for x and y , we are using threads for the computation inside the i loop.

So,generally ,is it better to keep the scope as 'local' as possible?

I will try it next week , thank you.

 

0 Kudos
TimP
Honored Contributor III
1,326 Views
Omp parallel should be better on the outer loop as you show it. Declaration in the outer loop scope is cleaner but has same effect as private.
0 Kudos
George
Beginner
1,326 Views

Jim Dempsey

I didn't notice any difference using this version.Still not vectorized..

I used the Light-weight code to initialize the MIC , ok .

Thank you!

 

Omp parallel should be better on the outer loop as you show it. Declaration in the outer loop scope is cleaner but has same effect as private.

Ok , thank you.I didn't notice any difference though.

 

0 Kudos
Reply