CGEMM performance strangeness on Haswell CPUs vs. Sandy Bridge

Henrik_S_1 · ‎02-20-2015

Hi All

I have investigated the performance of the CGEMM algorithm using both my own sandy bridge CPU and my colleagues newer computer with a Haswell cpu. The calculation is measured as the number of complex multiply accumulate operations per second it can perform, here denoted as CMacs. I don't use any scaling and I don't add the the previous matrix, so I only calculate C = A * B.

The setup:

The number of Rows in A = 2^16

Number of columns in A = 16

Number of columns in B = 256

GCMacs = A_r * A_c * B_c / time * 1e-9

Results:

Running with 1, 2, 4, and 8 threads on both machines gives me the following performance numbers on the sandy bridge i7-2670QM (Averaged over 10 times with an initial run before timing):

MKL execution using 1 threads: 3.27 GCMacs
MKL execution using 2 threads: 8.32 GCMacs
MKL execution using 4 threads: 10.7 GCMacs
MKL execution using 6 threads: 11.3 GCMacs
MKL execution using 8 threads: 11.4 GCMacs

Running the same code on the Haswell i7-4800MQ gives the following:

MKL execution using 1 threads: 4.98 GCMacs
MKL execution using 2 threads: 6.87 GCMacs
MKL execution using 4 threads: 7.47 GCMacs
MKL execution using 6 threads: 7.47 GCMacs
MKL execution using 8 threads: 7.31 GCMacs

Please note that the sandy bridge has 20 GB of memory, where the Haswell only has 8 GB of memory. But, the problem-size is far below this in size.

On a 6 core Xeon E5-2620 I get number that peak at just above 20 GCMacs

Can anybody explain the numbers I see? (Source code appended below)

Thanks

Henrik Andresen

Process function:

void processMKL( std::complex<float> * pBeams, 
     const std::complex<float> * pData, 
     const std::complex<float> * pWeights, 
     int nBins, 
     int nBeams, 
     int nChannels ){

 auto scalar = std::complex<float>( 1.0f, 0.0f );
 auto beta = complex<float>(0.0f, 0.0f );
 auto m = nBins;
 auto n = nBeams;
 auto k = nChannels;
 cblas_cgemm( CblasRowMajor, CBLAS_TRANSPOSE::CblasNoTrans, CBLAS_TRANSPOSE::CblasNoTrans,
     m, n, k, &scalar, pData, k, pWeights, n, &beta, pBeams, n );
}

Iteration Function:

 auto vThreads = std::vector<int>( {1, 2, 4, 6, 8, 12, 16, 20, 24} );
 for( int threads : vThreads ){
  mkl_set_num_threads( threads );
  processMKL( mBeams.data(), mData.data(), mSteer.data(), nFftBins, nBeams, nChannels );
  start_time = std::chrono::high_resolution_clock::now();
  for( auto ite = 0u; ite < nIterations * mklScale; ++ite ){
   processMKL( mBeams.data(), mData.data(), mSteer.data(), nFftBins, nBeams, nChannels );
  }
  stop_time = std::chrono::high_resolution_clock::now();
  auto timeMKL = double( std::chrono::duration_cast<std::chrono::microseconds>(stop_time - start_time).count()) * 1e-6;

  cout << "MKL execution using " << threads << " threads: " << setprecision(3) << time2GCMac( timeMKL/static_cast<double>(mklScale)) << " GCMac" << endl;
 }

Kazushige_G_Intel · ‎02-23-2015

Hi,

There are some variations for Xeon E5-2620 (none, v2 or v3), but can you clarify which Xeon you are using?

Also can you run same benchmark with follwoing environment variable on each processor?

export KMP_AFFINITY=verbose,compact,1,0,granularity=fine

export OMP_NUM_THREADS=4

Thanks,

Kazushige Goto

Henrik_S_1 · ‎02-23-2015

Hi

Thank you for your fast reply. The Xeon cpu I compare with is a v2, so it does not support the FMA.

Edit: Updated performance values: For the Sandy Bridge CPU I get

OMP: Info #204: KMP_AFFINITY: decoding x2APIC ids.
OMP: Info #202: KMP_AFFINITY: Affinity capable, using global cpuid leaf 11 info
OMP: Info #154: KMP_AFFINITY: Initial OS proc set respected: {0,1,2,3,4,5,6,7}
OMP: Info #156: KMP_AFFINITY: 8 available OS procs
OMP: Info #157: KMP_AFFINITY: Uniform topology
OMP: Info #179: KMP_AFFINITY: 1 packages x 4 cores/pkg x 2 threads/core (4 total
 cores)
OMP: Info #206: KMP_AFFINITY: OS proc to physical thread map:
OMP: Info #171: KMP_AFFINITY: OS proc 0 maps to package 0 core 0 thread 0
OMP: Info #171: KMP_AFFINITY: OS proc 1 maps to package 0 core 0 thread 1
OMP: Info #171: KMP_AFFINITY: OS proc 2 maps to package 0 core 1 thread 0
OMP: Info #171: KMP_AFFINITY: OS proc 3 maps to package 0 core 1 thread 1
OMP: Info #171: KMP_AFFINITY: OS proc 4 maps to package 0 core 2 thread 0
OMP: Info #171: KMP_AFFINITY: OS proc 5 maps to package 0 core 2 thread 1
OMP: Info #171: KMP_AFFINITY: OS proc 6 maps to package 0 core 3 thread 0
OMP: Info #171: KMP_AFFINITY: OS proc 7 maps to package 0 core 3 thread 1
OMP: Info #242: KMP_AFFINITY: pid 4348 thread 0 bound to OS proc set {0}
OMP: Info #242: KMP_AFFINITY: pid 4348 thread 1 bound to OS proc set {2}
OMP: Info #242: KMP_AFFINITY: pid 4348 thread 2 bound to OS proc set {4}
OMP: Info #242: KMP_AFFINITY: pid 4348 thread 3 bound to OS proc set {6}
MKL execution using 1 threads: 3.2 GCMac
MKL execution using 2 threads: 8 GCMac
MKL execution using 4 threads: 10.8 GCMac
MKL execution using 6 threads: 10.8 GCMac
MKL execution using 8 threads: 11.6 GCMac

The Haswell CPU now performs better and gives the following output:

OMP: Info #204: KMP_AFFINITY: decoding x2APIC ids.
OMP: Info #202: KMP_AFFINITY: Affinity capable, using global cpuid leaf 11 info
OMP: Info #154: KMP_AFFINITY: Initial OS proc set respected: {0,1,2,3,4,5,6,7}
OMP: Info #156: KMP_AFFINITY: 8 available OS procs
OMP: Info #157: KMP_AFFINITY: Uniform topology
OMP: Info #179: KMP_AFFINITY: 1 packages x 4 cores/pkg x 2 threads/core (4 tota
cores)
OMP: Info #206: KMP_AFFINITY: OS proc to physical thread map:
OMP: Info #171: KMP_AFFINITY: OS proc 0 maps to package 0 core 0 thread 0
OMP: Info #171: KMP_AFFINITY: OS proc 1 maps to package 0 core 0 thread 1
OMP: Info #171: KMP_AFFINITY: OS proc 2 maps to package 0 core 1 thread 0
OMP: Info #171: KMP_AFFINITY: OS proc 3 maps to package 0 core 1 thread 1
OMP: Info #171: KMP_AFFINITY: OS proc 4 maps to package 0 core 2 thread 0
OMP: Info #171: KMP_AFFINITY: OS proc 5 maps to package 0 core 2 thread 1
OMP: Info #171: KMP_AFFINITY: OS proc 6 maps to package 0 core 3 thread 0
OMP: Info #171: KMP_AFFINITY: OS proc 7 maps to package 0 core 3 thread 1
OMP: Info #242: KMP_AFFINITY: pid 6304 thread 0 bound to OS proc set {0}
OMP: Info #242: KMP_AFFINITY: pid 6304 thread 1 bound to OS proc set {2}
OMP: Info #242: KMP_AFFINITY: pid 6304 thread 2 bound to OS proc set {4}
OMP: Info #242: KMP_AFFINITY: pid 6304 thread 3 bound to OS proc set {6}
MKL execution using 1 threads: 5.37 GCMac
MKL execution using 2 threads: 7.16 GCMac
MKL execution using 4 threads: 7.16 GCMac
MKL execution using 6 threads: 10.7 GCMac
MKL execution using 8 threads: 10.7 GCMac

Summary:

Single threaded performance of the Haswell is significantly better, but multithreading is actually poorer. Is that because the algorithm is memory bound?

Extra Question:

I created my own code for doing matrix multiplication, which (for a single threaded case) performs about >85% of the MKL. However, when I try to use FMA on the Haswell processor, my performance drops to 50%. Any good indication of why that is? The change is shown below:

inline void subProc( const __m256 & a1, const __m256 & bReal, const __m256 & bImag, __m256 & c )
{
 __m256 v0 = _mm256_shuffle_ps( a1, a1, 177 );
 __m256 v2 = _mm256_mul_ps( a1, bReal );
 __m256 v1 = _mm256_mul_ps( bImag, v0 );
 v2 = _mm256_addsub_ps( v2, v1 );
 c = _mm256_add_ps( v2, c );
}

Changed to:

inline void subProc_FMA( const __m256 & a1, const __m256 & bReal, const __m256 & bImag, __m256 & c )
{
 c = _mm256_fmaddsub_ps( a1, bReal, c );
 __m256 v0 = _mm256_shuffle_ps( a1, a1, 177 );
 c = _mm256_fmadd_ps( bImag, v0, c );
}

Why does it give a penalty? I would understand no improvement, but not a significant reduction. Am I doing something wrong?

Thanks again

Henrik

Henrik_S_1 · ‎02-24-2015

Hi

Thank you for your fast reply. The Xeon cpu I compare with is a v2, so it does not support the FMA.

Edit: Updated performance values: For the Sandy Bridge CPU I get

OMP: Info #204: KMP_AFFINITY: decoding x2APIC ids.
OMP: Info #202: KMP_AFFINITY: Affinity capable, using global cpuid leaf 11 info
OMP: Info #154: KMP_AFFINITY: Initial OS proc set respected: {0,1,2,3,4,5,6,7}
OMP: Info #156: KMP_AFFINITY: 8 available OS procs
OMP: Info #157: KMP_AFFINITY: Uniform topology
OMP: Info #179: KMP_AFFINITY: 1 packages x 4 cores/pkg x 2 threads/core (4 total
 cores)
OMP: Info #206: KMP_AFFINITY: OS proc to physical thread map:
OMP: Info #171: KMP_AFFINITY: OS proc 0 maps to package 0 core 0 thread 0
OMP: Info #171: KMP_AFFINITY: OS proc 1 maps to package 0 core 0 thread 1
OMP: Info #171: KMP_AFFINITY: OS proc 2 maps to package 0 core 1 thread 0
OMP: Info #171: KMP_AFFINITY: OS proc 3 maps to package 0 core 1 thread 1
OMP: Info #171: KMP_AFFINITY: OS proc 4 maps to package 0 core 2 thread 0
OMP: Info #171: KMP_AFFINITY: OS proc 5 maps to package 0 core 2 thread 1
OMP: Info #171: KMP_AFFINITY: OS proc 6 maps to package 0 core 3 thread 0
OMP: Info #171: KMP_AFFINITY: OS proc 7 maps to package 0 core 3 thread 1
OMP: Info #242: KMP_AFFINITY: pid 4348 thread 0 bound to OS proc set {0}
OMP: Info #242: KMP_AFFINITY: pid 4348 thread 1 bound to OS proc set {2}
OMP: Info #242: KMP_AFFINITY: pid 4348 thread 2 bound to OS proc set {4}
OMP: Info #242: KMP_AFFINITY: pid 4348 thread 3 bound to OS proc set {6}
MKL execution using 1 threads: 3.2 GCMac
MKL execution using 2 threads: 8 GCMac
MKL execution using 4 threads: 10.8 GCMac
MKL execution using 6 threads: 10.8 GCMac
MKL execution using 8 threads: 11.6 GCMac

The Haswell CPU now performs better and gives the following output:

OMP: Info #204: KMP_AFFINITY: decoding x2APIC ids.
OMP: Info #202: KMP_AFFINITY: Affinity capable, using global cpuid leaf 11 info
OMP: Info #154: KMP_AFFINITY: Initial OS proc set respected: {0,1,2,3,4,5,6,7}
OMP: Info #156: KMP_AFFINITY: 8 available OS procs
OMP: Info #157: KMP_AFFINITY: Uniform topology
OMP: Info #179: KMP_AFFINITY: 1 packages x 4 cores/pkg x 2 threads/core (4 tota
cores)
OMP: Info #206: KMP_AFFINITY: OS proc to physical thread map:
OMP: Info #171: KMP_AFFINITY: OS proc 0 maps to package 0 core 0 thread 0
OMP: Info #171: KMP_AFFINITY: OS proc 1 maps to package 0 core 0 thread 1
OMP: Info #171: KMP_AFFINITY: OS proc 2 maps to package 0 core 1 thread 0
OMP: Info #171: KMP_AFFINITY: OS proc 3 maps to package 0 core 1 thread 1
OMP: Info #171: KMP_AFFINITY: OS proc 4 maps to package 0 core 2 thread 0
OMP: Info #171: KMP_AFFINITY: OS proc 5 maps to package 0 core 2 thread 1
OMP: Info #171: KMP_AFFINITY: OS proc 6 maps to package 0 core 3 thread 0
OMP: Info #171: KMP_AFFINITY: OS proc 7 maps to package 0 core 3 thread 1
OMP: Info #242: KMP_AFFINITY: pid 6304 thread 0 bound to OS proc set {0}
OMP: Info #242: KMP_AFFINITY: pid 6304 thread 1 bound to OS proc set {2}
OMP: Info #242: KMP_AFFINITY: pid 6304 thread 2 bound to OS proc set {4}
OMP: Info #242: KMP_AFFINITY: pid 6304 thread 3 bound to OS proc set {6}
MKL execution using 1 threads: 5.37 GCMac
MKL execution using 2 threads: 7.16 GCMac
MKL execution using 4 threads: 7.16 GCMac
MKL execution using 6 threads: 10.7 GCMac
MKL execution using 8 threads: 10.7 GCMac

Summary:

Single threaded performance of the Haswell is significantly better, but multithreading is actually poorer. Is that because the algorithm is memory bound?

Extra Question:

I created my own code for doing matrix multiplication, which (for a single threaded case) performs about the same as the MKL. However, when I try to use FMA on the Haswell processor, my performance drops to 50%. Any good indication of why that is? The only change I did is shown below:

inline void subProc( const __m256 & a1, const __m256 & bReal, const __m256 & bImag, __m256 & c )
{
 __m256 v0 = _mm256_shuffle_ps( a1, a1, 177 );
 __m256 v2 = _mm256_mul_ps( a1, bReal );
 __m256 v1 = _mm256_mul_ps( bImag, v0 );
 v2 = _mm256_addsub_ps( v2, v1 );
 c = _mm256_add_ps( v2, c );
}

Changed to:

inline void subProc_FMA( const __m256 & a1, const __m256 & bReal, const __m256 & bImag, __m256 & c )
{
 c = _mm256_fmaddsub_ps( a1, bReal, c );
 __m256 v0 = _mm256_shuffle_ps( a1, a1, 177 );
 c = _mm256_fmadd_ps( bImag, v0, c );
}

Why does it give a penalty? I would understand no improvement, but not a significant reduction. Am I doing something wrong?

Thanks again

Henrik

Kazushige_G_Intel · ‎02-24-2015

Hello,

Thank you very much for your update.

Since problem of K (Number of columns in A) is very small and performance on both processors are similar, it would be memory bound operation as you pointed out. Also blocking size for Haswell is smaller than SandyBridge, there is slight performance penalty to handle 256 (Number of columns in B). If you set to 192, performance will be slightly better.

About extra question. HSW FMA related instruction takes 5 cycles, but add operation takes only 3 cycles. As a result, the critical path of first code is two add operation and critical path of second code is two FMA operatoin. Therefore performance will be only 60%.

Thanks,

Kazushige Goto

Henrik_S_1 · ‎02-24-2015

Hi

Thank you for your reply. It really helps understanding my figures.

So, to be sure I understand correctly. I can't expect to see a performance increase in CGEMM when moving from AVX to AVX2 capable processors? Aside from any increases in clock rate, cache etc.?

Thanks again

Henrik

Kazushige_G_Intel · ‎02-25-2015

Since your application relies on main memory bandwidth, increasing processor performance will not improve performance very much. You need to find a system which has better main memory bandwidth, or you need to increase K (Number of columns in A). It K is twice larger, performance will be twice higher.

Thanks,

Kazushige Goto