Matt,

Matt_S_ · ‎07-01-2014

It appears that MKL is not optimized for the MIC and is much slower than on the CPU. Performing the computation C=A*A' (A is oblong, many more cols than rows. Stored as nrows*ncols sized vector) with cblas_dsyrk has a serial performance of around 1.1065459251 seconds on my CPU. My own version,

   #pragma omp parallel private(j,k)
   #pragma omp for nowait
   for (i=0; i<nrows; i++)
   {
      int idxr = i*nrows;
      int idxc = i*ncols;
      for (j=0; j<=i; j++)
      {
         int jdxc = j*ncols;
         #pragma ivdep
         for (k=0; k<ncols; k++)
         {
            C[idxr + j] += A[jdxc + k] * A[idxc + k];;
         }
      }
   }

runs in around 6.8782911301 seconds on my CPU. On the MIC, I get a serial cblas_dsyrk runtime of around 8.2774429321 seconds and my version runs in around 34.6947548389 seconds.

Fastest runtime for my version on the MIC is 1.1892209053 seconds on 228 threads. Fastest runtime for my version on the CPU is 1.6409392357 seconds on 24 threads.

Fastest runtime for cblas_dsyrk on the MIC is 1.5085399151 seconds on 228 threads. Fastest runtime for cblas_dsyrk on the CPU is 0.0766689777 seconds on 96 threads.

I have 2 questions:

Is there any way to make MKL run as fast on the MIC as it does on the CPU?
As an alternative, is there any way to make my version run as fast on the MIC as the MKL equivalent does on the CPU? (i.e. is there any modifications I can make to speed it up?)

Thanks in advance!

TimP · ‎07-01-2014

You didn't say which version of MKL or specific sizes you refer to. Past versions required min(rows, cols) >32 to enter the region of optimization where MKL is as fast as ifort or icc generated code. The latest (beta) version had attention to improvements in that size range, Small matrices can't use all the MIC threads effectively, but I don't know that reducing number of threads could match host.

The code you show might need on the order of 2000 rows and columns to approach full performance of 3 threads per core, and I have some doubt on whether it would unroll and jam efficiently (accumulate at least 4 values of j in parallel in each thread). The pragma unroll and jam might help, but I suppose people optimize these things with intrinsics.

I assume you refer to MIC native execution, as the automatic offload will not offload unless min(rows, cols) is large.

Matt_S_ · ‎07-02-2014

My version of ICC is 14.0.3 and MKL is 11.1.3. My sizes range between nrows = 300-400 and ncols = 150000-600000. I am indeed running natively, currently.

TimP · ‎07-02-2014

Your compiler and MKL versions ought to be good, although I have a case somewhat resembling your example where I must use inner_product to optimize. 15.0 improves on this by optimizing inner loop under #pragma omp simd reduction(+: ) (pragma may not be needed).

At a glance, your values of nrows indicate an average j loop count < 200 (which would be reduced by any unroll and jam) which may complicate attempts at work balancing in the code sample you show. Due to the variable length, the compiler may not heed pragma unroll_and_jam (which is active only at -O3) and it might not be productive, which means that the compiler should "riffle" the inner loop (batching the sums into multiple simd parallel accumulators).

The easiest way to experiment with improving work balance might be to add schedule(runtime) and experiment with OMP_SCHEDULE=static, dynamic,auto, and guided, with various chunk sizes. It would be feasible although a bit ugly to calculate how to distribute the work more evenly across threads explicitly in your program, using default static schedule, but I wouldn't bet on it coming out ahead.

Matt_S_ · ‎07-02-2014

Tim, do you have a specifics to show for "optimizing inner loop under #pragma omp simd reduction(+: ) (pragma may not be needed)." Also, what is "inner_product"? Is this an intrinsic call?

Thanks!

TimP · ‎07-02-2014

You would look for the inner loop to report vectorization under -opt-report. C++ stl inner product has been optimizing most reliably up to now. It works with plain pointers as well as with .begin .end . I wonder if omp simd reduction will make any style guidelines.

Matt_S_ · ‎07-02-2014

TEST.c(206): (col. 10) remark: LOOP WAS VECTORIZED
TEST.c(201): (col. 7) remark: loop was not vectorized: not inner loop
TEST.c(197): (col. 4) remark: loop was not vectorized: not inner loop

Above, is the output of -vec-report=3. Here are the corresponding loops

   #pragma omp parallel private(j,k)
   #pragma omp for nowait 
   for (i=0; i<nrows; i++) /* 197 */
   {
      int idxr = i*nrows;
      int idxc = i*ncols; 
      for (j=0; j<=i; j++) /* 201 */
      {
         int jdxc = j*ncols;
         #pragma vector aligned
         #pragma ivdep
         for (k=0; k<ncols; k++) /* 206 */
         {
            C[idxr + j] += A[jdxc + k] * A[idxc + k];
         }
      }
   }

TimP · ‎07-02-2014

So you got a satisfactory optimization report. You have asserted that all the arrays are 64-byte aligned each time the inner loop is entered (although the alignment of C[] may be ignored).

De-coupling the inner loop from C[] would eliminate need for ivdep and concerns about alignment of C[]:

double csum = 0;

for (k=0; k<ncols; k++)

csum += A[jdxc + k] * A[idxc + k];

C[idxr + j] += csum;

On many host targets, you may not get caught using vector aligned as a synonym for vector unaligned, but it's more likely to fail on MIC; while vector unaligned probably has more restricted utility on MIC.

jimdempseyatthecove · ‎07-03-2014

You really should use for(int I=..., for(int j=..., for(int k=...

It prevents an oversight of missing a private (you didn't) and it omits the copy to the outer scope of the parallel region.

The performance issue, as I see it, relates to the oblong-ness of the problem. You effectively have a triangle shaped workload where each thread receives an equal length slice along the a or b leg and where the amount of work is the distance from their slice position, perpendicular to the hypotenuse h.

What you need to do is change your outer loop to even the workload something like the sketched (untested) code below

 int work = nrows * ncols / 2; // Area of triangle == work
 // test for and adjust for potential bit loss by /2 above
 if((nrows * ncols) & 1)
  work += min(nrows, ncols);
 #pragma omp parallel
 #pragma omp for nowait
 for(int mywork=0; mywork<work; ++mywork) // evenly partition work
 {
  int idxr = mywork / ncols; // row index
  if(idxr == nrows) break;
  int idxc = mywork % ncols; // column index
  {
     for (int j=0; j<=i; j++) /* 201 */
     {
     int jdxc = j*ncols;
     #pragma vector aligned
     #pragma ivdep
     for (int k=0; k<ncols; k++) /* 206 */
     {
     C[idxr + j] += A[jdxc + k] * A[idxc + k];
     }
     }
  }
 }

Jim Dempsey

Matt_S_ · ‎07-03-2014

Combining what both Tim and Jim have said, I've come up with this

   int bsize = ncols/16;
   for (kk = 0; kk < ncols; kk += bsize)
   {
      for (i = 0; i < nrows; i++)
      {
         #pragma omp parallel private(k, sum)
         #pragma omp for nowait
         for (j = 0; j <= i; j++)
         {
            sum = C[i*nrows + j];
            #pragma simd reduction(+:sum)
            for (k = kk; k < kk + bsize; k++)
            {
               sum += A[i*ncols + k] * A[j*ncols + k];
            }
            C[i*nrows + j] = sum;
         }
      }
   }

It turns out that this program is fastest when the for pragma lies on the j loop, for whatever reason. Jim, your untested version had a segfault somewhere, but I believe the above snippet achieves what you were trying (cutting the problem into nrows x bsize chunks to better fit in the cache. After playing around, a bsize chunk = ncols/16 ran fastest. This loop is now running in around 900ms, but that still isn't as good as MKL on the CPU by a long shot. The only thing I can think of without getting too hairy is doing some prefetching. Any thoughts?

Thank you all for the input!

TimP · ‎07-03-2014

I suppose moviing the omp for to the inner loop may be an adequate solution fro the problem of triangular work distribution incurred by putting it on the outer loop.

As A[j*ncols + k] is accessed repeatedly, a strategy to minimize its cache evictions should be more useful than prefetching. Perhaps your cache blocking has accomplished that. The other operand ought to be prefetched adequately by default, with the exception that you could play with the 1st and 2nd level prefetch distances and possibly the initial prefetch value (more of them prefetched prior to entering the inner loop).

James_C_Intel2 · ‎07-04-2014

That code seems slightly perverse, since there is an overhead to #pragma omp parallel which you are paying many times. It seems likely that simply executing the outer loops redundantly in each thread would be faster, since there is no real work there.

Something like this (typed in here, not compiled, not tested)

int bsize = ncols/16;
#pragma omp parallel 
for (kk = 0; kk < ncols; kk += bsize)
{
   int i;
   for (i = 0; i < nrows; i++)
   {
      int j;
      #pragma omp for nowait
      for (j = 0; j <= i; j++)
      {
         int k;
         double sum = C[i*nrows + j];
         #pragma simd reduction(+:sum)
         for (k = kk; k < kk + bsize; k++)
         {
            sum += A[i*ncols + k] * A[j*ncols + k];
         }
         C[i*nrows + j] = sum;
      }
   }
}

jimdempseyatthecove · ‎07-05-2014

Matt,

You are right about the seg fault. After I submitted that I left the office, then determined the submitted code was incorrect.

The problem with the submitted code is the idxr, idxc extraction function from mywork is producing a subset of a square shaped address space. This isn't right. A better function needs to be written that converts the linear index mywork, into the two idxr idxc indicies. Other than that, the premise of the suggestion is sound.

Jim Dempsey

Matt_S_ · ‎07-07-2014

James, here is an updated version per your suggestion, which runs much faster! Down to 0.3573529720 seconds on the MIC w/ 57 threads (1/core), versus 0.2501699924 seconds on the CPu w/ 24 threads (1/core).

   int bsize = ncols/25;
   #pragma omp parallel
   for (int kk = 0; kk < ncols; kk += bsize)
   {
      for (int i = 0; i < nrows; i++)
      {
         #pragma omp for nowait
         for (int j = 0; j <= i; j++)
         {
            double sum = C[i*nrows + j];
            #pragma simd reduction(+:sum)
            for (int k = kk; k < kk + bsize; k++)
            {
               sum += A[i*ncols + k] * A[j*ncols + k];
            }
            C[i*nrows + j] = sum;
         }
      }
   }

Jim, does this version perform the same intended split up as your suggestion? If not, how does it it differ?

Thanks again for the help, hope everyone had a good 4th of July.

Matt_S_ · ‎07-09-2014

Also, does anyone happen to have the official FLOP count of clbas_dsyrk? I cannot find it anywhere online.

Thanks again!

MKL versus OpenMP