Hi Jim,

Matthew_S_1 · ‎09-29-2014

This is my first post on the forum, so please be gentle. I've been experimenting with a Phi card for some parallel CFD solvers. Preliminary results are fine - however, I've encountered some interesting problems when using #pragma ivdep for some simple vector calculation. Here is a stripped down version of the code that produces the error:

#include <stdio.h>
#include <stdlib.h>
#include <malloc.h>
#include <offload.h>
#include <omp.h>

int main() {

   double *p0, *p1;
   double *u0, *u1;
   size_t alignment = 64;
   int error;
   int NC = 1000;
   int i;

   // Allocate memory
   error = posix_memalign((void**)&p0, alignment, NC*sizeof(double));
   error = posix_memalign((void**)&p1, alignment, NC*sizeof(double));
   error = posix_memalign((void**)&u0, alignment, NC*sizeof(double));
   error = posix_memalign((void**)&u1, alignment, NC*sizeof(double));

   // Initialize data
   for (i = 0; i < NC; i++) {
       p0 = 1.0; u0 = p0;
       p1 = 2.0; u1 = p0*p1;
   }

   // Offload to Phi
   #pragma offload target(mic:0) inout(p0,p1,u0,u1:length(NC))
   {
       omp_set_num_threads(10);
       #pragma omp parallel shared(u0, u1, p0, p1)
       {
           #pragma omp for
           #pragma vector aligned
           #pragma ivdep
           for (i = 0; i < NC; i++) {
               p0 = u0;
               p1 = u1/u0;
           }
       } // End of OMP parallel section
   } // End of Phi offload

   // Free memory
   free(p0); free(p1);
   free(u0); free(u1);
return 0;

}

The build instruction (taken from the makefile) is: icc -opt-report-phase=offload -offload-attribute-target=mic -openmp -O3 crash.cpp -o test.run

When this is run, I see a message reporting an offload error - process terminated by signal 11 (SIGSEGV). If I remove "#pragma ivdep", the code runs without any trouble. I've written the same code using intrinsic functions with openmp, which also runs without any trouble - hence, I know this section of code can be vectorized without any drama, but I'm having trouble doing this without resorting to intrinsic functions. I can provide the output of the offload report if anyone is interested. I suspect the Phi is trying to access data on the host, which is causing the seg fault. If anyone has any ideas on this, it would be fantastic. If I'm posting in the wrong spot, please let me know.

Cheers!

Kevin_D_Intel · ‎09-30-2014

This appears to be case of unaligned access by the threads due to the distribution of the iterations. I’m leaning on the real experts (you know who you are :-) ) in this forum on this very topic to weigh in and help explain this in better detail and share ideas/thoughts about how best to address beyond what I showed below.

There is a very helpful article titled Data Alignment to Assist Vectorization from where I found the suggestion below. Refer to the section on Vector Alignment and Parallelization. Your program is a perfect fit for the seg-fault that can occur asserting #pragma vector aligned but not ensuring alignment for each thread.

The code “as is” runs successfully with offload when setting the number of threads to 1. It also runs after adding code as shown below to restrict the iteration count of the parallel loop to be a good multiple as per the example.

Here's the compile/execution with the added code below:

$ icc -V
Intel(R) C Intel(R) 64 Compiler XE for applications running on Intel(R) 64, Version 15.0.0.090 Build 20140723
Copyright (C) 1985-2014 Intel Corporation.  All rights reserved.

$ icc -openmp -O3 u532092.c
$ ./a.out
num_threads = 10, NC = 1000, N1 = 960, num-iters in remainder serial loop = 40, parallel-pct = 96.000000

Here's the modified code:

#include <stdio.h>
#include <stdlib.h>
#include <malloc.h>
#include <offload.h>
#include <omp.h>

int main() {
    double *p0, *p1;
    double *u0, *u1;

    size_t alignment = 64;
    int error;
    int NC = 1000;
    int i;
    int N1,num_threads;

    // Allocate memory
    error = posix_memalign((void**)&p0, alignment, NC*sizeof(double));
    error = posix_memalign((void**)&p1, alignment, NC*sizeof(double));
    error = posix_memalign((void**)&u0, alignment, NC*sizeof(double));
    error = posix_memalign((void**)&u1, alignment, NC*sizeof(double));

    // Initialize data
    for (i = 0; i < NC; i++) {
        p0 = 1.0;  u0 = p0;
        p1 = 2.0;  u1 = p0*p1;
    }

    // Offload to Phi
    #pragma offload target(mic:0) inout(p0,p1,u0,u1:length(NC))
    {
        omp_set_num_threads(10);

        #pragma omp parallel
        {
          #pragma omp master
          {
              num_threads = omp_get_num_threads();
              // Assuming omp static scheduling, carefully limit the loop-size to N1 instead of NC
              N1 = ((NC / num_threads)/8) * num_threads * 8;
              printf("num_threads = %d, NC = %d, N1 = %d, num-iters in remainder serial loop = %d, parallel-pct = %f\n", num_threads, NC, N1, NC-N1, (float)N1*100.0/NC);
          }
        }

        #pragma omp parallel shared(u0, u1, p0, p1)
        {
            #pragma omp for
            #pragma vector aligned
            #pragma ivdep

            for (i = 0; i < N1; i++) {
                p0 = u0;
                p1 = u1/u0;
            }
        } // End of OMP parallel section

        // Serial loop to process the last NC-N1 iterations
        for (i = N1; i < NC; i++) {
             p0 = u0;
             p1 = u1/u0;
        }
    } // End of Phi offload

    // Free memory
    free(p0); free(p1);
    free(u0); free(u1);
    return 0;
}

Kevin_D_Intel · ‎09-30-2014

Here are two other related discussed in the forum:

Xeon Phi Segmentation Fault Simple Offload (https://software.intel.com/en-us/forums/topic/509037)
Effect of using array alignment (https://software.intel.com/en-us/forums/topic/507547)

jimdempseyatthecove · ‎09-30-2014

If you have OpenMP 4.0 or later use

#pragma omp for simd

Jim Dempsey

Kevin_D_Intel · ‎09-30-2014

@Jim - Is this the correct transformation for his offload code using omp simd? (Updated 9/30/14: I corrected the example below to use both vectorization and parallelization.)

    // Offload to Phi
    #pragma offload target(mic:0) inout(p0,p1,u0,u1:length(NC))
    {
        #pragma omp parallel for simd
            for (i = 0; i < NC; i++) {
                p0 = u0;
                p1 = u1/u0;
            }
    } // End of Phi offload

In addition to Jim's solution, another cleaner and simpler solution (to the horrible hand scheduling method I showed earlier) shared with me "is to use the OpenMP chunk_size to ensure that however the loop is parallelized the smallest set of iterations executed by a single thread maintains alignment."

The offload code for this solution is:

    // Offload to Phi
    #pragma offload target(mic:0) inout(p0,p1,u0,u1:length(NC))
    {
        #pragma omp parallel for schedule(static,alignment/sizeof(double)), shared(u0, u1, p0, p1)
        #pragma vector aligned
        #pragma ivdep
            for (i = 0; i < NC; i++) {
                p0 = u0;
                p1 = u1/u0;
            }
    } // End of Phi offload

As further explained, "by explicitly stating schedule(static, 8) you ensure that the parallel chunks will maintain the eight doubles alignment (and all except the last chunk in each thread will be multiple of eight long)."

This and the earlier hand scheduling method "reduces the available parallelism by a factor of eight, but you get vectorization so it’s likely worth it. (Though at NC==1000, you certainly don’t have enough parallelism for KNC, since 1000/8 == 125 and therefore you won’t get good scaling)."

Matthew_S_1 · ‎09-30-2014

Thanks Kevin and Jim for the feedback. I can confirm the problem - Kevin was right, it was (essentially) because the number of elements of the array being computed by each OpenMP thread wasn't a multiple of 8. To demonstrate - when I altered my original code to set NC = 2560 and employ 160 threads on the Phi device - equating to 16 elements per thread - then there is no segmentation fault. I'm assuming the compiler is assigning __m512d constants (holding 8 doubles) to perform the operations (i.e. _mm512_div_pd) and requires an integer number of these - otherwise, threads will start overlapping / interfering with each other. In my previous dealings with intrinsic functions being used together with OpenMP on the Phi device, I always manually checked this. I wasn't sure if the compiler managed the details of this when directives were used, and now I know.

As for the last comment from Jim - I'll play around with the #pragma omp for simd to check the timing etc. Thanks again for the feedback.

Kevin_D_Intel · ‎10-16-2014

I wanted to share this additional follow-up clarification from Development. They wrote:

The issue we are discussing is connected to only data-placement (whether or not first-access of each array is aligned inside each thread whenever the said loop is encountered by a thread’s execution). The problem is not whether threads will overlap and work on the same data, BUT whether or not the data is aligned inside each thread when execution enters the loop.

The operations performed inside the loop (divides, sqrt, multiplies, fmas etc.) are not relevant for this discussion - only the alignment of the first array access inside the loop for each thread (Note that the loop will go from my_tid_lower_bound to my_tid_upper_bound for each thread instead of from 0 to N after openmp parallelization).

User can add a ASSERT(…) stmt for the alignment of the arrays using each thread’s lower-bound inside the loop - without the refactoring of the code, the assertions will fail (due to OMP semantics for static scheduling if NC is not a multiple of 8)

Note that the refactoring is required/needed only when the user wants to add “#pragma vector aligned” for the loop (that gets parallelized and then vectorized at the same level). If that pragma is removed, then there is no stability problem even when NC is not a multiple of 8 - compiler handles all those details (and no refactoring to N1 is needed).

Ioan_Hadade · ‎03-12-2015

Hi guys,

I've been searching for a while now and this issue reflects perfectly the one I am having. I had originally guessed that the cause was due to the iteration space division of OpenMP. I also originally thought that #pragma omp parallel simd would fix this issue since it would assure that the first access of each thread would be on an aligned address but that doesn't seem to work for me. I still get seg faults and debugging the code I found that the iteration space division does indeed fall on unaligned space.

I just want to confirm whether pragma omp simd guarantees that it takes into account alignment access on work division or not. It doesn't seem to do it on my compiler version (icc (ICC) 15.0.0 20140723). Yes, the code does work if I leave out #pragma vector aligned but the point is that OpenMP generates unaligned accesses on certain thread numbers which hinders performance.

Should I use the previous example of controlling the chunk size manually depending on the thread numbers allocated? I was hesitant in doing that as it would require a hefty change in our rather large solver.

Thank you guys.

Ioan_Hadade · ‎03-13-2015

Also, as a further comment to the above. If I try to calculate a chunk size for the omp scheduler and give that as a clause, the compiler doesn't want to vectorize the loop complaining that it is not in a canonical form. The code in question is written in C++ and the loop counter is not known at compile them as it is read from a turbomachinery blade geometry file.

If anyone has any idea regarding this I would be very thankful.

Ioan

jimdempseyatthecove · ‎03-13-2015

>> #pragma omp parallel simd would fix this issue since it would assure that the first access of each thread would be on an aligned address

The pragma means, partition the iteration space such that each thread begins at a vector offset (or cache line offset) from the beginning of the array(s) and extends to, and includes, an end of vector offset (excepting for partial last partition). The process _assumes_ (requires) the beginning of the array(s) are aligned.

Try adding align clause:

#pragma offload target(mic:0) inout(p0,p1,u0,u1:length(NC), align(64))

Jim Dempsey

Ioan_Hadade · ‎03-16-2015

Hi Jim,

Thank you for your reply. The application is being run natively rather than through an offload clause. Furthermore, I have an intrinsics version parallelised through an omp parallel for with static scheduling which works without a hitch on aligned loads and stores as I make sure to pad for a multiple of 8 doubles on the longest running dimension. The beginning of the arrays are aligned and I also tell that to the compiler through the relevant assume clauses.

The code looks something like this:

#pragma vector aligned
#pragma omp parallel for simd private(iq,t,h,p,ro,dt,dp,dk,re,dro,dre,e,ke,ss,v1,v2,dv1,dv2,indx0,indx1,indx2,indx3) \
firstprivate(cv,gam,rg,id0,id1,id2,id3) reduction(+:res0,res1,res2,res3) shared(srhs,sq,slhs,saux,nq) schedule(static) default(none)
for(iq=0;iq<nq;iq++)
{
// increment each section index by iteration step
indx0 = id0 + iq;
indx1 = id1 + iq;
indx2 = id2 + iq;
indx3 = id3 + iq;

v1 = sq[indx0];

v2 = sq[indx1];
t= sq[indx2];
p= sq[indx3];
ro = saux[indx0];

h = saux[indx3];

re= ro*h- p;
e= re/ro;

// invdg
// manual unrolling
srhs[indx0] /= slhs[indx0];
srhs[indx1] /= slhs[indx0];
srhs[indx2] /= slhs[indx0];
srhs[indx3] /= slhs[indx0];
// compute residuals
res0 += fabs( srhs[indx0]*slhs[indx0]);
res1 += fabs( srhs[indx1]*slhs[indx0]);
res2 += fabs( srhs[indx2]*slhs[indx0]);
res3 += fabs( srhs[indx3]*slhs[indx0]);

dro= srhs[indx0];
dre= srhs[indx3];
dv1 = (srhs[indx1] - v1 * dro) / ro;
dv2 = (srhs[indx2] - v2 * dro) / ro;
// v1 and v2 here should be the old ones.
dk = ((v1 * dv1) + (v2 * dv2)) * ro;

//dk*= ro;
dt= ( dre- e*dro- dk )/( ro*cv );
dp= p*( dro/ro+ dt/t );

sq[indx0] += dv1;
sq[indx1] += dv2;
sq[indx2] += dt;
sq[indx3] += dp;

}

The id0,id1.id2.id3 indices are computed before the loop since we are using linear storage for the vectors and makes sure the starting addresses are at the right offset which and aligned on the 64 byte boundary. When I specifically tell the compiler that I want a static scheduling with a chunk size multiple of the SIMD length, it says it cannot vectorize the loop due to it not being in a canonical form. I guess it cannot compute this in the required compilation time (I am using -O3 to give it the maximum available threshold). One of the issues might be that the size of the domain, nq is being read from a file at runtime and is also a data member of a C++ class. I think these seem to create some issues in openmp as I have encountered some anomalies a while back.

I cannot use the simd aligned(srhs:64,saux:64,srhs:64,sq:64) clause however as it complains that these are already within the parallel for shared clause. If I remove these and remove default(none) to make them automatically shared and add the aligned clause, the problem doesn't disappear. I am trying to get my head around whether this is a bug in the implementation or I am doing something wrong?

Furthermore, at the beginning of the loop, I also specific the following:

// extra alignment assumptions
__assume_aligned(srhs, ALIGN);
__assume_aligned(saux, ALIGN);
__assume_aligned(slhs, ALIGN);
__assume_aligned(sq, ALIGN);

__assume(id0%64==0);
__assume(id1%64==0);
__assume(id2%64==0);
__assume(id3%64==0);

Also, memory is allocated through _mm_malloc with the right alignment and as earlier mentioned, it is aligned as the intrinsics version works without a hitch. I would really want to get the parallel autovectorization version to run on Phi since leaving intrinsics for such a simple kernel would pose problems in the future in terms of maintainability.

Thank you again for your consideration.

Ioan

jimdempseyatthecove · ‎03-16-2015

Ioan,

Two suggestions (apply in order listed):

a) use the restrict keyword on the array pointers (this assures no alias).
b) Remove the index helpers indx? and explicitly use the expression id?+iq (simplifies optimization)

Jim Dempsey

Ioan_Hadade · ‎03-16-2015

Thank you Jim.

The method already has restricted arguments with the -restrict compilation flag.

I also had a version of the code with explicit index computation with the same issue.

jimdempseyatthecove · ‎03-16-2015

Also:

__assume(id0%64==0);
...

Though this may be correct for your application (code unseen), is not a requirement.

__assume((id0 * sizeof(sq[0]))%64==0);
... is a requirement.

Jim Dempsey

Ioan_Hadade · ‎03-16-2015

Thanks Jim.

I'll try with those clauses and leave out #pragma vector aligned hoping that the compiler will generate align loads by himself.

Ioan_Hadade · ‎03-17-2015

Hi Jim,

Thank you for your efforts. I have followed the path you guided me unto and played a little bit with the assume clauses. I realised that it is not a good idea to have a separate index counter in the main loop and rather used the id0-id3 themselves as firstprivate and updated the indices like that. Furthermore, after adding these clauses:

npadded = nq+PAD;

// fld->dvar( 0,nq, q,aux, rhs,daux );
id0 = 0 * npadded;
id1 = 1 * npadded;
id2 = 2 * npadded;
id3 = 3 * npadded;

// extra alignment assumptions
__assume_aligned(srhs, ALIGN);
__assume_aligned(saux, ALIGN);
__assume_aligned(slhs, ALIGN);
__assume_aligned(sq, ALIGN);

__assume(id0%8==0);
__assume(id1%8==0);
__assume(id2%8==0);
__assume(id3%8==0);

__assume((id0*sizeof(sq[0]))%64==0);
__assume((id1*sizeof(sq[0]))%64==0);
__assume((id2*sizeof(sq[0]))%64==0);
__assume((id3*sizeof(sq[0]))%64==0);

__assume((id0*sizeof(saux[0]))%64==0);
__assume((id1*sizeof(saux[0]))%64==0);
__assume((id2*sizeof(saux[0]))%64==0);
__assume((id3*sizeof(saux[0]))%64==0);

__assume((id0*sizeof(srhs[0]))%64==0);
__assume((id1*sizeof(srhs[0]))%64==0);
__assume((id2*sizeof(srhs[0]))%64==0);
__assume((id3*sizeof(srhs[0]))%64==0);

__assume((id0*sizeof(slhs[0]))%64==0);

__assume(npadded%8==0);
__assume(npadded%ALIGN==0);

the compiler managed to generate a peel loop and then the main loop all with beautifully aligned loads and stores without the need of #pragma vector aligned. Another subtle change I made was to make sure that the padded size is a local variable of the method and not directly accessed from the main object's data member. I guess what I learned is that one has to help out the compiler as much as possible in these cases.

LOOP BEGIN at domain.cpp(1611,10)
remark #15388: vectorization support: reference sq has aligned access [ domain.cpp(1622,13) ]
remark #15388: vectorization support: reference sq has aligned access [ domain.cpp(1623,13) ]
remark #15388: vectorization support: reference sq has aligned access [ domain.cpp(1624,13) ]
remark #15388: vectorization support: reference sq has aligned access [ domain.cpp(1625,13) ]

One small note however, using #pragma vector aligned still does not work as it seg faults but I can live with that as the version with unaligned peel loop and aligned main body performs almost the same as the intrinsics version.

Thanks again Jim for your help.

jimdempseyatthecove · ‎03-17-2015

Ioan,

>>Another subtle change I made was to make sure that the padded size is a local variable of the method and not directly accessed from the main object's data member.

Interesting discovery. I will have to keep this in mind. Good detective work.

If you have a small reproducer see if you can submit it for review and comment or fix.

Jim

Offload error: signal 11 (SIGSEGV) for simple code