openmp compiler optimization

Allam_F_ · ‎04-16-2014

Dear All,

I have the following code :

for(k = 0; k < MAX_ITER; k++){
        #pragma omp parallel for private(i,j,sum)
                for(i=0; i<N; i++){
                        sum=0.0;
                        sum = sum-A[i*N+i] * X;
                        for(j=0; j<N; j++)
                                sum += A[i*N+j] * X;
                        new_x = (B - sum)/A[i*N+i];
                }
        #pragma omp parallel for private (i)
                for(i=0; i < N; i++)
                        X = new_x;
        }

I optimized it to remove the overhead of the Pragma and the implicit barrier to the following code

#pragma omp parallel shared (A,B,X,N,T,kk) private(k,i,ii,j,sum,Xnew_sub)
{
        Xnew_sub=(double*)malloc((kk)*sizeof(double));
        int tid=omp_get_thread_num();
        ii=tid*kk;
for(k=0;k<MAX_ITER;k++)
{

                         for(i=0; i<kk; i++){
                        sum=0.0;
                        sum = sum-(A[(i+ii)*N+i+ii]*X[i+ii]);

                        for(j=0; j<N; j++){
                                sum += A[(i+ii)*N+j] * X;
                                }

                        Xnew_sub= (B[i+ii] - sum)/A[(i+ii)*N+i+ii];
                        }
        #pragma omp barrier

        for(i=0;i<kk;i++){
                X[i+ii]=Xnew_sub;
        }

I run it on MIC , the problem is there is no optimization. In contrast the second code have more execution time. I use the Intel Compiler version 13 upate 1 with the defualt optimization option. How can I know what is the optimization the compiler did implicitly on the code.
}//end of the iteration MAX_ITER

James_C_Intel2 · ‎04-17-2014

Your "optimsed" code looks broken (and overly complicated). You are allocating Xnew_sub in each thread and there's a race condition on the assignment, and you have also made "sum" shared (another race). You are then implementing #pragma omp for explicitly, which is unnecessary.

I think what you are striving for is something like this

#pragma omp parallel shared(A,B,X,new_x)
{
    int k;

    for(k = 0; k < MAX_ITER; k++)
    {
        int i;
        
#pragma omp for
        for (i=0; i<N; i++)
        {
            int j;
            double sum = -A[i*N+i] * X;
            for(j=0; j<N; j++)
                sum += A[i*N+j] * X;
            new_x = (B - sum)/A[i*N+i];
        }
#pragma omp for
        for (i=0; i<N; i++)
            X = new_x;
    }
}

As you can see, this is much more like your original code. I don't believe that you can eliminate either of the two implicit barriers at the end of the for loops (because you can't update X until all of the previous loop is complete, and, similarly you can't start reading X in the next k loop until it has been fully update from the previous one). If you believe you can, then you still don't need to do all the scheduling work yourself, just add a "nowait" clause to the omp for.

jimdempseyatthecove · ‎04-17-2014

Or this:

double* X = NULL;
double* new_x = NULL;
...
// allocate X, new_ cache aligned
...
#pragma omp parallel shared(A,B,X,new_x)
{
    int k;

    for(k = 0; k < MAX_ITER; k++)
    {
        int i;
        
#pragma omp for
        for (i=0; i<N; i++)
        {
            int j;
            double sum = -A[i*N+i] * X;
            for(j=0; j<N; j++)
                sum += A[i*N+j] * X;
            new_x = (B - sum)/A[i*N+i];
        }

#pragma omp master
        {
          // just swap pointers
          double* temp = X;
          X =  new_x;
          new_x = temp;

        }
    }
}

Jim Dempsey

Allam_F_ · ‎04-17-2014

Than you Mr.Jim. But what is the optimization key in your implementation. My question is why is my implementation doesn't scale better that the naive implementation ( first code). I need to understand what is the problem of my code .

Allam_F_ · ‎04-17-2014

For Mr.James

My implementation is true in both code. I test it with compare with the sequential implementation. But from my understanding to the openmp . when I eliminate the overhead of the pragmas and the barrier I must get better speed up and the code must scale well in case there is no problems in the code that affect the performance. I need to analyze my code in this situation.

To remove the option that the compiler did an optimization I compile it with the -O0 option and run the code a gain on Xeon Phi. But unfortunaltly the same problem arise that the first code better in refer to the execution time.

James_C_Intel2 · ‎04-17-2014

Before you measure performance you need to have correct code, otherwise the performance measurements are meaningless. Your "optimized" version is incorrect and has the potential to produce wrong answers (the shared sum is certainly a big issue), so its performance is irrelevant to any real discussion. At least one of its problems (allocating the large array once per thread) is also costly, so could represent the performance issue you're seeing.

Whether or not replacing a fork/join with a barrier improves performance is not clear. Certainly replacing a fork/join with two or more barriers is likely to bad trade off.

I am not recommending any specific tuning or optimization, merely trying to get you to a correct code that you can measure.

Allam_F_ · ‎04-17-2014

Thank you M.James . But if you look at the code Sum is declared private in the pragma !!!!!!!!!!!!!!!!. In addition it is clear that the first code the #pragma for is used inside the iteration for so it is used as the iteration goes on. Also, the first code has in each for an implicit barrier. This is removed in the second implementation. So the performance issues that I talk a bout is. First put the iteration inside #pragma share. Second remove the second barrier and add only one barrier. I think this is enough to make a difference in the both implementation.

The point That I would like to get it from you is a bout "declaring a large variable to the thread" this is I did it so the thread write in its private unless this data updated to the shared variable. This is used to remove the second barrier from the code. I hope to get your comments a bout this.

Patrick_S_ · ‎04-17-2014

jimdempseyatthecove wrote:

Or this:

double* X = NULL;
double* new_x = NULL;
...
// allocate X, new_ cache aligned
...
#pragma omp parallel shared(A,B,X,new_x)
{
    int k;

    for(k = 0; k < MAX_ITER; k++)
    {
        int i;
        
#pragma omp for
        for (i=0; i<N; i++)
        {
            int j;
            double sum = -A[i*N+i] * X;
            for(j=0; j<N; j++)
                sum += A[i*N+j] * X;
            new_x = (B - sum)/A[i*N+i];
        }

#pragma omp master
        {
          // just swap pointers
          double* temp = X;
          X =  new_x;
          new_x = temp;

        }
    }
}

Jim Dempsey

I think that you should use #pragma omp single for swapping the pointers. The pragma "omp master" has no implied barrier. Otherwise it may happen that all threads (except the master) will still have the old pointer (for some iterations).

jimdempseyatthecove · ‎04-17-2014

sum is private, it needs to be private. Each thread in #pragma omp for has a different (unique and all encompassing) sub-range of 0:N-1

At the close } of the #pragma omp for is an implied barrier.

You are correct about using Single as opposed to Master

Using #pragma omp single will introduce an extra barrier.

Consider this:

double* X = NULL;
double* new_x = NULL;
...
// allocate X, new_ cache aligned
...
#pragma omp parallel shared(A,B,X,new_x)
{
 double* X_ = X;
 double* new_x_ = new_x;
    int k;

    for(k = 0; k < MAX_ITER; k++)
    {
        int i;
        
#pragma omp for
        for (i=0; i<N; i++)
        {
            int j;
            double sum = -A[i*N+i] * X_;
            for(j=0; j<N; j++)
                sum += A[i*N+j] * X_;
            new_x_ = (B - sum)/A[i*N+i];
        }

 // just swap pointers
 double* temp = X_;
 X_ =  new_x_;
 new_x_ = temp;
    }
} // end parallel
if(MAX_ITR & 1)
{
 double* temp = X;
 X = new_x;
 new_x = temp;
}

Jim Dempsey

Allam_F_ · ‎04-19-2014

Thanks Jim Dempsey. I try your code Ok it is working well but still we don't have faster code than the first one. I think because the overhead of the synchronization is on factor of microsecond . So , we will not notice any improvement in our implementation. But , your suggestion has a nice thing whih is swapping between the variable. In this case the reuse of the data increased but I don't notice that. How we can insure that the data will be read from the cache and not from the memory.

jimdempseyatthecove · ‎04-21-2014

If your code for new_x_ = ... is not using streaming stores, then the data written into new_x_ will remain in cache provided that each thead's (plus HT sibling threads) span of N does not consume more than the cache of interest. Of interest is A, B, X_, new_x_ (4 arrays).

L1 = 32KB / (sizeof(float)*4) = 2048
at 2 threads/core 1024, 3 threads/core 682, 4 threads/core 512

You will have to leave some margin for slack.

Note, if A and B will not be used shortly thereafter, then consider use of the _clevict (cache line eviction intrinsic) on A and/or B as the case may be, thus permitting more of new_x_ to remain in cache.

The timing loop above (MAX_ITER) will not be present in your application. I suggest you NOT concentrate on maximizing the performance of the above loop for other than academic purposes. For practical use (as in eventual application), you must consider how this loop participates in the problem on the whole when combined with the activities of the other threads within the core as well ad the other threads within the processor.

As you have noticed, time entering and exiting a parallel region must be considered.

It might be of interest for you to follow:

https://software.intel.com/en-us/blogs/2014/02/26/the-chronicles-of-phi-part-1-the-hyper-thread-phalanx
https://software.intel.com/en-us/blogs/2014/03/05/the-chronicles-of-phi-part-2-hyper-thread-phalanx-tiled-ht1
https://software.intel.com/en-us/blogs/2014/03/12/the-chronicles-of-phi-part-3-hyper-thread-phalanx-tiled-ht1-continued
https://software.intel.com/en-us/blogs/2014/03/24/the-chronicles-of-phi-part-4-hyper-thread-phalanx-tiled-ht2
https://software.intel.com/en-us/blogs/2014/02/22/the-chronicles-of-phi-part-5-plesiochronous-phasing-barrier-tiled-ht3

These are blogs I posted recently on the IDZ blogs section. It would be beneficial to read all 5 parts in order, the first one is not so much as instruction as it is for background information.

The series illustrates methods of knocking down barrier latencies.

Please note, while the techniques were written for Xeon Phi, they also work well on the host processor(s).

Jim Dempsey