Hey everyone,
I'm looking for help about loop vectorization. I'm trying to vectorize and optimize some loop but I don't understand what I'm doing wrong. The compiler can not vectorize because of FLOW and ANTI dependency. I thought I could remove it by doing a change in the code cf. "my attempt code" but this is not working. Can someone explain me why ?
I can not post the whole code becasue this is too large: (knowing that "i,j" are indices and "np, npa, cX, K, t, exprdt" are constants.). Basically, I make some computation with an input vector S, the output is the vector P (which is an input too ~ updating values).
The original code:
#pragma omp parallel for num_threads(nThreads) schedule(auto) for( int k(i*npa); k<(i+1)*npa; ++k ) { double CT = 0., tmp; double s_ = s[j*np+k]; double st_ = s_ * t; CT += c1 + s_ *( c2 + s_ * c3 ); if(p>3) CT += t * c4 + st_ * ( c5 + s_ * c6 ); CT *= exprdt; tmp = K > s_ ? K-s_: 0. ; if( tmp <= 1.0e-8 || tmp <= CT ) P[j*np+k] = exprdt * P[(j+1)*np+k]; else P[j*np+k] = tmp; }
My Attempt:
#pragma omp parallel num_threads(nThreads) { int iD = omp_get_thread_num(); int gd = npa / nThreads; int dg = (iD==nThreads-1? npa%nThreads:0); double *Aptr, *Bptr, *Cptr; Aptr = (double*)malloc((gd+dg)*sizeof(double)); Bptr = (double*)malloc((gd+dg)*sizeof(double)); Cptr = (double*)malloc((gd+dg)*sizeof(double)); memcpy( Cptr, s+j*np + i*npa+iD*gd, (gd+dg)*sizeof(double)); memcpy( Bptr, P + j *np+np + i * npa + iD*gd , (gd+dg) *sizeof(double)); for( int l =0; l<gd+dg; ++l ) // 671 { double CT = 0., tmp; double s_ =Cptr; // 675 double st_ = s_ * t; CT += c1 + s_ *( c2 + s_ * c3 ); if(p>3) CT += t * c4 + st_ * ( c5 + s_ * c6 ); CT *= exprdt; tmp = K > s_ ? K-s_: 0. ; if( tmp <= 1.0e-8 || tmp <= CT ) Aptr = exprdt * Bptr ; // 690 else Aptr = tmp; // 692 } memcpy( P+j*np + i*npa + iD*gd, Aptr, (dg+gd) *sizeof(double)); free(Aptr); free(Bptr); free(Cptr); }
And here's the compilator report about vectorization:
(671): (col. 5) remark: loop was not vectorized: existence of vector dependence
(690): (col. 7) remark: vector dependence: assumed FLOW dependence between line 690 and line 675
(675): (col. 17) remark: vector dependence: assumed ANTI dependence between line 675 and line 690
(690): (col. 7) remark: vector dependence: assumed ANTI dependence between line 690 and line 692
(692): (col. 7) remark: vector dependence: assumed FLOW dependence between line 692 and line 690
(690): (col. 7) remark: vector dependence: assumed ANTI dependence between line 690 and line 690
(690): (col. 7) remark: vector dependence: assumed FLOW dependence between line 690 and line 690
(690): (col. 7) remark: vector dependence: assumed FLOW dependence between line 690 and line 690
(690): (col. 7) remark: vector dependence: assumed ANTI dependence between line 690 and line 690
(692): (col. 7) remark: vector dependence: assumed FLOW dependence between line 692 and line 675
(675): (col. 17) remark: vector dependence: assumed ANTI dependence between line 675 and line 692
(692): (col. 7) remark: vector dependence: assumed FLOW dependence between line 692 and line 690
(690): (col. 7) remark: vector dependence: assumed ANTI dependence between line 690 and line 692
(675): (col. 17) remark: vector dependence: assumed ANTI dependence between line 675 and line 692
(692): (col. 7) remark: vector dependence: assumed FLOW dependence between line 692 and line 675
(675): (col. 17) remark: vector dependence: assumed ANTI dependence between line 675 and line 690
(690): (col. 7) remark: vector dependence: assumed FLOW dependence between line 690 and line 675
Link Copied
If using c++ prior to icpc 16.0, max(K-s__,0.) should help.
It looks like you would need either * __restrict Aptr or pragma ivdep or simd. Agreed that your local definitions should not require that, but the compiler diagnostics indicate that aliasing is assumed.
Even then it's a stretch.
Yes with pragma simd, it is vectorizing but I just wanted to understand why with the local definitions I had the same problem. Anyway, I'm gonna look at compiler version and try with the max.
Thank you for your response and help.
Guillaume,
Hi,
Tim's response is correct in that aliasing is assumed by the compiler and the use of restrict qualifier should help the vectorize the code.
_Kittur
For more complete information about compiler optimizations, see our Optimization Notice.