- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Dear All,
I have the following code :
for(k = 0; k < MAX_ITER; k++){
#pragma omp parallel for private(i,j,sum)
for(i=0; i<N; i++){
sum=0.0;
sum = sum-A[i*N+i] * X;
for(j=0; j<N; j++)
sum += A[i*N+j] * X
new_x = (B - sum)/A[i*N+i];
}
#pragma omp parallel for private (i)
for(i=0; i < N; i++)
X = new_x;
}
I optimized it to remove the overhead of the Pragma and the implicit barrier to the following code
#pragma omp parallel shared (A,B,X,N,T,kk) private(k,i,ii,j,sum,Xnew_sub)
{
Xnew_sub=(double*)malloc((kk)*sizeof(double));
int tid=omp_get_thread_num();
ii=tid*kk;
for(k=0;k<MAX_ITER;k++)
{
for(i=0; i<kk; i++){
sum=0.0;
sum = sum-(A[(i+ii)*N+i+ii]*X[i+ii]);
for(j=0; j<N; j++){
sum += A[(i+ii)*N+j] * X
}
Xnew_sub= (B[i+ii] - sum)/A[(i+ii)*N+i+ii];
}
#pragma omp barrier
for(i=0;i<kk;i++){
X[i+ii]=Xnew_sub;
}
I run it on MIC , the problem is there is no optimization. In contrast the second code have more execution time. I use the Intel Compiler version 13 upate 1 with the defualt optimization option. How can I know what is the optimization the compiler did implicitly on the code.
}//end of the iteration MAX_ITER
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Your "optimsed" code looks broken (and overly complicated). You are allocating Xnew_sub in each thread and there's a race condition on the assignment, and you have also made "sum" shared (another race). You are then implementing #pragma omp for explicitly, which is unnecessary.
I think what you are striving for is something like this
#pragma omp parallel shared(A,B,X,new_x) { int k; for(k = 0; k < MAX_ITER; k++) { int i; #pragma omp for for (i=0; i<N; i++) { int j; double sum = -A[i*N+i] * X; for(j=0; j<N; j++) sum += A[i*N+j] * X; new_x = (B - sum)/A[i*N+i]; } #pragma omp for for (i=0; i<N; i++) X = new_x; } }
As you can see, this is much more like your original code. I don't believe that you can eliminate either of the two implicit barriers at the end of the for loops (because you can't update X until all of the previous loop is complete, and, similarly you can't start reading X in the next k loop until it has been fully update from the previous one). If you believe you can, then you still don't need to do all the scheduling work yourself, just add a "nowait" clause to the omp for.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Or this:
double* X = NULL; double* new_x = NULL; ... // allocate X, new_ cache aligned ... #pragma omp parallel shared(A,B,X,new_x) { int k; for(k = 0; k < MAX_ITER; k++) { int i; #pragma omp for for (i=0; i<N; i++) { int j; double sum = -A[i*N+i] * X; for(j=0; j<N; j++) sum += A[i*N+j] * X; new_x = (B - sum)/A[i*N+i]; } #pragma omp master { // just swap pointers double* temp = X; X = new_x; new_x = temp; } } }
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Than you Mr.Jim. But what is the optimization key in your implementation. My question is why is my implementation doesn't scale better that the naive implementation ( first code). I need to understand what is the problem of my code .
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
For Mr.James
My implementation is true in both code. I test it with compare with the sequential implementation. But from my understanding to the openmp . when I eliminate the overhead of the pragmas and the barrier I must get better speed up and the code must scale well in case there is no problems in the code that affect the performance. I need to analyze my code in this situation.
To remove the option that the compiler did an optimization I compile it with the -O0 option and run the code a gain on Xeon Phi. But unfortunaltly the same problem arise that the first code better in refer to the execution time.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Before you measure performance you need to have correct code, otherwise the performance measurements are meaningless. Your "optimized" version is incorrect and has the potential to produce wrong answers (the shared sum is certainly a big issue), so its performance is irrelevant to any real discussion. At least one of its problems (allocating the large array once per thread) is also costly, so could represent the performance issue you're seeing.
Whether or not replacing a fork/join with a barrier improves performance is not clear. Certainly replacing a fork/join with two or more barriers is likely to bad trade off.
I am not recommending any specific tuning or optimization, merely trying to get you to a correct code that you can measure.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Thank you M.James . But if you look at the code Sum is declared private in the pragma !!!!!!!!!!!!!!!!. In addition it is clear that the first code the #pragma for is used inside the iteration for so it is used as the iteration goes on. Also, the first code has in each for an implicit barrier. This is removed in the second implementation. So the performance issues that I talk a bout is. First put the iteration inside #pragma share. Second remove the second barrier and add only one barrier. I think this is enough to make a difference in the both implementation.
The point That I would like to get it from you is a bout "declaring a large variable to the thread" this is I did it so the thread write in its private unless this data updated to the shared variable. This is used to remove the second barrier from the code. I hope to get your comments a bout this.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
jimdempseyatthecove wrote:
Or this:
double* X = NULL; double* new_x = NULL; ... // allocate X, new_ cache aligned ... #pragma omp parallel shared(A,B,X,new_x) { int k; for(k = 0; k < MAX_ITER; k++) { int i; #pragma omp for for (i=0; i<N; i++) { int j; double sum = -A[i*N+i] * X; for(j=0; j<N; j++) sum += A[i*N+j] * X; new_x = (B - sum)/A[i*N+i]; } #pragma omp master { // just swap pointers double* temp = X; X = new_x; new_x = temp; } } } Jim Dempsey
I think that you should use #pragma omp single for swapping the pointers. The pragma "omp master" has no implied barrier. Otherwise it may happen that all threads (except the master) will still have the old pointer (for some iterations).
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
sum is private, it needs to be private. Each thread in #pragma omp for has a different (unique and all encompassing) sub-range of 0:N-1
At the close } of the #pragma omp for is an implied barrier.
You are correct about using Single as opposed to Master
Using #pragma omp single will introduce an extra barrier.
Consider this:
double* X = NULL; double* new_x = NULL; ... // allocate X, new_ cache aligned ... #pragma omp parallel shared(A,B,X,new_x) { double* X_ = X; double* new_x_ = new_x; int k; for(k = 0; k < MAX_ITER; k++) { int i; #pragma omp for for (i=0; i<N; i++) { int j; double sum = -A[i*N+i] * X_; for(j=0; j<N; j++) sum += A[i*N+j] * X_; new_x_ = (B - sum)/A[i*N+i]; } // just swap pointers double* temp = X_; X_ = new_x_; new_x_ = temp; } } // end parallel if(MAX_ITR & 1) { double* temp = X; X = new_x; new_x = temp; }
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Thanks Jim Dempsey. I try your code Ok it is working well but still we don't have faster code than the first one. I think because the overhead of the synchronization is on factor of microsecond . So , we will not notice any improvement in our implementation. But , your suggestion has a nice thing whih is swapping between the variable. In this case the reuse of the data increased but I don't notice that. How we can insure that the data will be read from the cache and not from the memory.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
If your code for new_x_ = ... is not using streaming stores, then the data written into new_x_ will remain in cache provided that each thead's (plus HT sibling threads) span of N does not consume more than the cache of interest. Of interest is A, B, X_, new_x_ (4 arrays).
L1 = 32KB / (sizeof(float)*4) = 2048
at 2 threads/core 1024, 3 threads/core 682, 4 threads/core 512
You will have to leave some margin for slack.
Note, if A and B will not be used shortly thereafter, then consider use of the _clevict (cache line eviction intrinsic) on A and/or B as the case may be, thus permitting more of new_x_ to remain in cache.
The timing loop above (MAX_ITER) will not be present in your application. I suggest you NOT concentrate on maximizing the performance of the above loop for other than academic purposes. For practical use (as in eventual application), you must consider how this loop participates in the problem on the whole when combined with the activities of the other threads within the core as well ad the other threads within the processor.
As you have noticed, time entering and exiting a parallel region must be considered.
It might be of interest for you to follow:
https://software.intel.com/en-us/blogs/2014/02/26/the-chronicles-of-phi-part-1-the-hyper-thread-phalanx
https://software.intel.com/en-us/blogs/2014/03/05/the-chronicles-of-phi-part-2-hyper-thread-phalanx-tiled-ht1
https://software.intel.com/en-us/blogs/2014/03/12/the-chronicles-of-phi-part-3-hyper-thread-phalanx-tiled-ht1-continued
https://software.intel.com/en-us/blogs/2014/03/24/the-chronicles-of-phi-part-4-hyper-thread-phalanx-tiled-ht2
https://software.intel.com/en-us/blogs/2014/02/22/the-chronicles-of-phi-part-5-plesiochronous-phasing-barrier-tiled-ht3
These are blogs I posted recently on the IDZ blogs section. It would be beneficial to read all 5 parts in order, the first one is not so much as instruction as it is for background information.
The series illustrates methods of knocking down barrier latencies.
Please note, while the techniques were written for Xeon Phi, they also work well on the host processor(s).
Jim Dempsey
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page