I would suggest to let the

YW · ‎12-18-2014

In SSE 4.1 we can use _mm_dp_ps. How about in KNC?

Thank!

TimP · ‎12-18-2014

I've submitted the suggested report about being blocked from attempting to reply.

YW · ‎12-18-2014

Tim Prince wrote:

I've submitted the suggested report about being blocked from attempting to reply.

Looks like the thread can be replied now. I appreciate any advice. Thanks!

TimP · ‎12-18-2014

I suppose your efficiency rating is according to minimum number of instructions required, once the data are packed into a register. Most people rate efficiency in terms of application performance, and are content to allow a compiler to find a better way, for example the implementation of inner_product() or dot_product, where the final sum reduction is performed in binary tree fashion.

If you prefer the KNC horizontal add, there is some description at

https://software.intel.com/en-us/articles/intel-xeon-phi-coprocessor-vector-microarchitecture

YW · ‎12-18-2014

Tim Prince wrote:

I suppose your efficiency rating is according to minimum number of instructions required, once the data are packed into a register. Most people rate efficiency in terms of application performance, and are content to allow a compiler to find a better way, for example the implementation of inner_product() or dot_product, where the final sum reduction is performed in binary tree fashion.

If you prefer the KNC horizontal add, there is some description at

https://software.intel.com/en-us/articles/intel-xeon-phi-coprocessor-vec...

Tim, thanks for your reply. I would use FLOPs to measure the efficiency here. I wrote simple dot product functions in C to run on MIC but only got about 8GFLOPs using all cores for single precision FP (0.4% of the peak performance), i.e. the compiler is not really hopeful here. And MKL doesn't help in this case, either, probably because my vector length is too short... Therefore, I think using intrinsics may be the way to improve the performance.

I will read through the material you suggest to see if I can leverage VPUs better.

jimdempseyatthecove · ‎12-18-2014

If you post your code, we can enlighten you as to any problems you may have with it.

Jim Dempsey

YW · ‎12-18-2014

jimdempseyatthecove wrote:

If you post your code, we can enlighten you as to any problems you may have with it.

Jim Dempsey

void vectorMatMultiply(void* data1, int mat_size, void* data2, int vec_size, void* output_data, int output_size)
{
  memset(output_data, 0, output_size);
  float* mat=(float*)data1;
  float* vec=(float*)data2;
  float* output=(float*)output_data;
  int col = vec_size/sizeof(float);
  int row = mat_size/sizeof(float)/col;
  for (int i=0; i<row; i++)
  {
    for (int j=0; j<col; j++)
    {
      output += mat[i*col+j]*vec;
    }
  }
  return;
}

  int count=0;
  while (count<iter)
  {
    float *mat=total_mat+count*ROW*COL;
    #pragma omp parallel for private(j)
    for (i=0; i<nVec; i++)
    {
      (*vectorMatMultiply)((void*)mat, sizeof(float)*COL*ROW, (void*)(mat+i*COL), sizeof(float)*COL, (void*)(corrs+i*ROW+count*nVec*ROW), sizeof(float)*ROW);
    }
    count++;
  }

Thanks!

Jeongnim_K_Intel1 · ‎12-18-2014

I would suggest to let the compiler vectorize it in combination of few simple things:

* align the data : if COL%16 !=0, use COL_MAX to pad the data

* add restrict & constant

* use float* instead of void*: void -> float casting is seldom needed

* insert #pragma omp simd to loop j

With 15 compilers, you should be able to get pretty optimized mat-vec of this type. You can have a look at the assembly generated by the compiler and learn what intrinsics are used, if you take care the basic things.

If mat-vec is your target and COL and ROW are large, using MKL sgemv for float will give the performance automatically.

YW · ‎12-18-2014

JEONGNIM K. (Intel) wrote:

I would suggest to let the compiler vectorize it in combination of few simple things:

* align the data : if COL%16 !=0, use COL_MAX to pad the data

* add restrict & constant

* use float* instead of void*: void -> float casting is seldom needed

* insert #pragma omp simd to loop j

With 15 compilers, you should be able to get pretty optimized mat-vec of this type. You can have a look at the assembly generated by the compiler and learn what intrinsics are used, if you take care the basic things.

If mat-vec is your target and COL and ROW are large, using MKL sgemv for float will give the performance automatically.

Thanks, will have a try! But what do you mean by "add restrict & constant"?

Also, do I have to use icc 15? My current version is icc 14.0.0.

TimP · ‎12-18-2014

With MKL, or inner_product() or loop organized for #pragma omp simd reduction(+: ) (or even cilk reducer) there may be no need for __restrict qualifiers here, but the idea is well taken.

As Jeongnm said, the MKL substitution should produce reasonable performance with much less pain, given that vector parallel compilation is needed for reasonable performance on MIC. The current MKL introduced additional features to achieve good performance for "small" matrices (I suppose dimensions of 20 to 50 or so) but the suggestions would be applicable to 14.0 also.

YW · ‎12-18-2014

Tim Prince wrote:

With MKL, or inner_product() or loop organized for #pragma omp simd reduction(+: ) (or even cilk reducer) there may be no need for __restrict qualifiers here, but the idea is well taken.

As Jeongnm said, the MKL substitution should produce reasonable performance with much less pain, given that vector parallel compilation is needed for reasonable performance on MIC. The current MKL introduced additional features to achieve good performance for "small" matrices (I suppose dimensions of 20 to 50 or so) but the suggestions would be applicable to 14.0 also.

cblas_sgemm of MKL doesn't speed up in my case probably because my COL here is only about 12.

I still don't understand the "restrict" part. Could you elaborate?

James_C_Intel2 · ‎12-19-2014

I still don't understand the "restrict" part. Could you elaborate?

Wikipedia has a reasonable explanation.

YW · ‎12-19-2014

Hi guys,

I rewrote the code under your suggestions as below. Note that COL=16 so the data is aligned. I still get quite bad performance (although slightly better) at abuot 11GFlops. Also, sgemv of MKL works three times worse in this case. Any further suggestions? Will the loop on iter and nVec affect a lot?

void vectorMatMultiply(float* mat, int mat_size, float* vec, int vec_size, float* output, int output_size)
{
  memset((float*)output, 0, output_size);
  int col = vec_size/sizeof(float);
  int row = mat_size/sizeof(float)/col;
  for (int i=0; i<row; i++)
  {
    #pragma omp simd reduction (+:output)
    for (int j=0; j<col; j++)
    {
      output += mat[i*col+j]*vec;
    }
  }
  return;
}

int count=0;
  while (count<iter)
  {
    float *mat=total_mat+count*ROW*COL;
    #pragma omp parallel for private(j)
    for (i=0; i<nVec; i++)
    {
      (*vectorMatMultiply)(mat, sizeof(float)*COL*ROW, mat+i*COL, sizeof(float)*COL, corrs+i*ROW+count*nVec*ROW, sizeof(float)*ROW);
    }
    count++;
  }

Thanks in advance!

TimP · ‎12-19-2014

You should write the omp simd reduction with an explicit scalar sum variable, or use inner_product:

for (int i=0; i<row; i++)

{

float sum = 0;

#pragma omp simd reduction (+:sum)

for (int j=0; j<col; j++)

sum += mat[i*col+j]*vec;

output = sum;

}

and, since you are using a relatively unreliable version of icc, at least check vectorization report.

If you don't set -openmp or -openmp-simd, or put the pragma in the right columns, you should get a warning about unimplemented pragma. With current icc, the spelling changed to -qopenmp/-qopenmp-simd

You can assert the alignment of vec[] by the omp simd aligned clause (if it's not aligned, it will fail at run time). Alignment of mat[] is more difficult. With such a short loop, you won't get a large fraction of potential peak performance. You could place a #pragma loop_count(16) for good luck.

YW · ‎12-19-2014

Tim Prince wrote:

You should write the omp simd reduction with an explicit scalar sum variable, or use inner_product:

for (int i=0; i<row; i++)

{

    float sum = 0;

#pragma omp simd reduction (+:sum)

       for (int j=0; j<col; j++)

            sum += mat[i*col+j]*vec;

    output = sum;

}

and, since you are using a relatively unreliable version of icc, at least check vectorization report.

If you don't set -openmp or -openmp-simd, or put the pragma in the right columns, you should get a warning about unimplemented pragma. With current icc, the spelling changed to -qopenmp/-qopenmp-simd

Adding the scalar sum doesn't improve the performance. The following is the vector report that I got for the corresponding lines. I used -fopenmp in the Makefile.

vec_mat.c(30): (col. 5) remark: vectorization support: unroll factor set to 8
vec_mat.c(30): (col. 5) remark: OpenMP SIMD LOOP WAS VECTORIZED
vec_mat.c(32): (col. 7) remark: vectorization support: reference mat has unaligned access
vec_mat.c(32): (col. 7) remark: vectorization support: reference vec has unaligned access
vec_mat.c(32): (col. 7) remark: vectorization support: unaligned access used inside loop body
vec_mat.c(30): (col. 5) remark: PEEL LOOP WAS VECTORIZED
vec_mat.c(32): (col. 7) remark: vectorization support: reference mat has unaligned access
vec_mat.c(32): (col. 7) remark: vectorization support: reference vec has aligned access
vec_mat.c(32): (col. 7) remark: vectorization support: reference mat has unaligned access
vec_mat.c(32): (col. 7) remark: vectorization support: reference vec has aligned access
vec_mat.c(32): (col. 7) remark: vectorization support: unaligned access used inside loop body
vec_mat.c(30): (col. 5) remark: REMAINDER LOOP WAS VECTORIZED

YW · ‎12-19-2014

Also, I found from micsmc that my program generates an unignorable system usage as the attached image. Could you please help me figure out what it is and how to improve?

Thanks!

McCalpinJohn · ‎12-19-2014

You mentioned that the value of "col" was originally 12 and has been increased to 16. What about the value for "row" ?

Comment 1: There is no way that this code will ever run well with a vector length of 12-16 because the SIMD vector width of the machine is 16. It only takes one cycle to do the 12 (or 16) multiplications for the inner loop, but it will take at least 4 shift operations plus 4 vector add operations to compute the sum of the 12 (or 16) product values for each row. This gets you down under 10% of peak performance even if everything is perfectly aligned, hand-coded, and all the data is in the L1 cache. Performance will go down if the data is not in the L1 cache, or if the compiler cannot guarantee that everything is aligned, or if extra instructions are needed to implement the shift functionality in the horizontal summation.

Comment 2: I don't know how how big these problems are, but you should be aware that OpenMP synchronization operations have relatively high overheads. I measured the overhead of an OpenMP "parallel for" region on 240 threads (60 cores * 4 threads/core) at over 20,000 cycles. Unless you have several hundred thousand cycles of real work for each OpenMP thread, the results will be biased by OpenMP synchronization overhead.

Comment 3: For a long dot product (i.e., for two vectors each larger than the aggregate L2 cache), performance will be limited by sustainable memory bandwidth. Typical values for main memory read bandwidth are in the 160 GB/s range. SDOT requires 2 elements to be loaded for each FP multiply-add, so 160 GB/s corresponds to ~40 GFLOPS in single precision and ~20 GFLOPS in double precision.

Performance could be higher for L2-contained data. Each core can sustain a load bandwidth of approximately 8 cache lines every 24 cycles, or ~5.3 floats/cycle, which corresponds to ~5.8 GFLOPS/core at 1.1 GHz. Scaling to 60 cores gives ~350 GFLOPS (single precision). For these smaller sizes you need to be careful with OpenMP synchronization overheads, since they can run very quickly : E.g., 0.5 MiB per core at ~24 GB/s per core is about 20 microseconds, which is about the same order of magnitude as the OpenMP Parallel For overhead.

YW · ‎12-19-2014

John D. McCalpin wrote:

You mentioned that the value of "col" was originally 12 and has been increased to 16. What about the value for "row" ?

Comment 1: There is no way that this code will ever run well with a vector length of 12-16 because the SIMD vector width of the machine is 16. It only takes one cycle to do the 12 (or 16) multiplications for the inner loop, but it will take at least 4 shift operations plus 4 vector add operations to compute the sum of the 12 (or 16) product values for each row. This gets you down under 10% of peak performance even if everything is perfectly aligned, hand-coded, and all the data is in the L1 cache.   Performance will go down if the data is not in the L1 cache, or if the compiler cannot guarantee that everything is aligned, or if extra instructions are needed to implement the shift functionality in the horizontal summation.

Comment 2: I don't know how how big these problems are, but you should be aware that OpenMP synchronization operations have relatively high overheads. I measured the overhead of an OpenMP "parallel for" region on 240 threads (60 cores * 4 threads/core) at over 20,000 cycles. Unless you have several hundred thousand cycles of real work for each OpenMP thread, the results will be biased by OpenMP synchronization overhead.

Comment 3: For a long dot product (i.e., for two vectors each larger than the aggregate L2 cache), performance will be limited by sustainable memory bandwidth.   Typical values for main memory read bandwidth are in the 160 GB/s range. SDOT requires 2 elements to be loaded for each FP multiply-add, so 160 GB/s corresponds to ~40 GFLOPS in single precision and ~20 GFLOPS in double precision.

Performance could be higher for L2-contained data. Each core can sustain a load bandwidth of approximately 8 cache lines every 24 cycles, or ~5.3 floats/cycle, which corresponds to ~5.8 GFLOPS/core at 1.1 GHz. Scaling to 60 cores gives ~350 GFLOPS (single precision).    For these smaller sizes you need to be careful with OpenMP synchronization overheads, since they can run very quickly : E.g., 0.5 MiB per core at ~24 GB/s per core is about 20 microseconds, which is about the same order of magnitude as the OpenMP Parallel For overhead.

Thanks for your comments, they are very helpful. The ROW in my case is at the 30,000-40,000 level.

I admit that since the vector length is very limited (~16), I cannot get very well performance. But even 10% of the peak should be 200GFLOPs (single precision), what I am having now is 12GFLOPs...

I guess the system usage of my running program comes from the openmp overhead, maybe I should put more work to an OMP thread to avoid the high system usage rate?

YW · ‎12-19-2014

I figure out that my program is actually memory bound.

int count=0;
  while (count<iter)
  {
    float *mat=total_mat+count*ROW*COL;
    #pragma omp parallel for
    for (i=0; i<nVec; i++)
    {
      vectorMatMultiply(mat, sizeof(float)*COL*ROW, mat+i*COL, sizeof(float)*COL, corrs+i*ROW+count*nVec*ROW, sizeof(float)*ROW);
    }
    count++;
  }

For the code above, if I don't change the mat in every iteration (i.e. change line 4 to be: float *mat=total_mat;), and don't change the output address in every iteration (i.e. change line 8's corrs+i*ROW+count*nVec*ROW to be corrs+i*ROW), I can improve the performance by almost a factor of 4, and I think this is the best possible performance I can get given my small COL.

So the question is, for this memory bound program, is it possible to improve the performance? Will prefetch help?

Thanks!

McCalpinJohn · ‎12-20-2014

With "row" in the range of 30,000 to 40,000 and "col" set to 16, then the "mat" array will occupy 1.92e6 to 2.56e6 Bytes, while the "output" array will occupy 0.12e6 to 0.16e6 Bytes. It looks like the original code uses completely independent "mat", "output", and "vec" arrays, so these are the sizes per OpenMP thread. With even one OpenMP thread per core these are significantly larger than the 0.52e6 Bytes of L2 cache per core, so the "mat" array will be read from memory while the "output" array will be read from memory and re-written to memory (albeit at a much slower rate).

Having all of the core read a single "mat" array reduces the footprint to 2.56e6/60=42667 Bytes/core, which is larger than L1 but smaller than L2. The output array needs to be handled more carefully since it is being updated, but if you distribute the updates across the cores it requires .16e6/60=26667 Bytes per core, which still fits comfortably in the L2 cache.

As I noted in my "comment 3", the sustainable memory bandwidth of the Xeon Phi is typically in the 160 GB/s range, while the sustainable L2 bandwidth is ~23GB/s per core, or ~1400 GB/s for 60 cores. This is a speedup of almost 9x. So it is not surprising that you are seeing a speedup when reducing the array size in this way.

It is common for bottlenecks in the core to reduce the number of outstanding cache misses and therefore to reduce the sustained memory bandwidth. If this is the case then tuning the software prefetches may help. If the software prefetches are helpful, then using large pages is likely to help as well.

How to do dot product in KNC efficiently?