Re: Re:ITPP performance with MKL is worse compared to standard BLAS and LAPACK

Hany · ‎12-07-2020

Hi,

I've written a PHY stack that encodes and decodes WLAN 802.11 frames using the ITPP library. The ITPP library provides a set of classes for linear algebra and signal processing and it can use standard BLAS and LAPACK libraries or more optimized implementation of these libraries like IntelMKL.

Now I am doing some performance evaluation for my implementation and trying different optimization options to speed up calculations and execution time. Using standard BLAS and LAPACK libraries in Linux with FFTW 3.8, my program can generate and decode 1K packets per 180 seconds. However, when I switch to the latest version of Intel MKL, the execution time increases to 210 seconds which is not expected. These are my current settings:

1. OS: Linux Ubuntu 20.04.1 LTS running in VM using VMWare Workstation running on Windows 10 Guest OS.
2. Processor Model: Intel(R) Core(TM) i7-8665U CPU @ 1.90GHz
3. CPU assigned to the VM: 3 CPUs
4. All the libraries are built as shared ones. ITPP is in release mode and my application in debug mode.

I tried another setup with much higher settings, and I still get worse performance with MKL compared to standard BLAS and LAPACK. These are my new settings:
1. OS: Native Linux Ubuntu 20.04.1 LTS
2. Processor Model: Intel(R) Core(TM) i9-10900X CPU @ 3.70GHz
3. CPUs: 20
4. All the libraries are built as shared ones. ITPP is in release mode and my application in debug mode.

In this setup, I achieve 155 seconds with standard BLAS and LAPACK compared to 183.5 seconds with MKL.

What could be the possible reason for having a worse performance with MKL despite using an Intel processor? It seems that ITPP is not optimally utilizing MKL APIs. Does anyone have any experience with this or how can I find the source of this issue?

More information:

I am using the latest OneAPI Base toolkit version (2021.1).
g++ version (g++ (Ubuntu 9.3.0-17ubuntu1~20.04) 9.3.0).
Each packet consists of 1000 bytes of payload. After encoding the packet using an LDPC encoder, the number of samples becomes 528256.

Gennady_F_Intel · ‎12-07-2020

Could you share the list of mkl's routine you call and the typical problem sizes?

Hany · ‎12-08-2020

So the ITPP library uses the following set of routines for BLAS (1,2 and,3)

https://sourceforge.net/p/itpp/git/ci/master/tree/itpp/base/blas.h

In my code, I do the following set of operations:

1. Concatenating two or three vectors. This method uses zcopy_ from BLAS.

2. Concatenating two matrices. This method uses zcopy_ from BLAS.

3. Zero paddings that use zcopy_ from BLAS.

4. I extensively use the LDPC encoding/decoding library which performs a lot of vector-vector and matrix-vector operations. I assume an implicit call to BLAS routines here.

So the MKL should bring benefit as it is being called at different places in the code.

At the same time, I use other functions that seem not to use any BLAS routines such as reshaping a vector into a matrix.

Regarding problem size, it is hard to give an exact number but I will give some rough estimation.

1. The program generates vectors of sizes using a random number generator from ITPP (8000 bits = 1000 bytes).

2. It performs some operations including adding preamble, header, encoding and modulating, code spreading, symbol rotation which generates a vector of a complex double with 528256 elements.

3. The frame is transmitted through an AWGN channel to add white gaussian noise to each element (symbol)

3. The exact opposite operations are then performed to decode and extract the originally transmitted bits.

4. Finally, the two original transmitted and receive data bits are compared to check if the frame is successfully decoded or not.

Gennady_F_Intel · ‎12-09-2020

if it is true when you achived 155 sec with standard BLAS vs 183.5 sec when linked against the MKL, then let's compare the performance case by case.

I am not sure how well zcopy ( BLAS level1) is optimized ( as this is mostly memory bandwidth problems ) but please give us some BLAS Level3 or Lapack comparison examples.

Hany · ‎12-14-2020

So, I wrote a small program that multiplies two complex matrices of size 1000x1000 and it repeats that 100 times. The multiplication of two matrices in ITPP uses BLAS-3 functions. Indeed with Intel MKL, I get a much lower time compared to standard BLAS. Here is my code:

#include "itpp/base/mat.h"
#include "itpp/base/random.h"
#include "itpp/itstat.h"

using namespace itpp;

int
main (int argc, char *argv[])
{
  uint32_t size = 10000;
  uint32_t num_rep = 2;

  cmat mat1 = randn_c (size, size);
  cmat mat2 = randn_c (size, size);

  //Declare tt as an instance of the timer class:
  Real_Timer tt;

  //Start and reset the timer:
  tt.tic();

  for (uint i = 0; i < num_rep; i++)
    {
      std::cout << "Multiplying sample = " << i + 1 << std::endl;
      mat1 * mat2;
    }

  // Print the elapsed time
  double av_time = tt.get_time ()/num_rep;

  std::cout << "average time = " << av_time << std::endl;

  return 0;
}

Here are the average results in seconds to complete a single matrix multiplication:

Intel MKL: 0.336063
Standard BLAS: 2.66953

Also, with Intel MKL, all the CPUs assigned to my VM are utilized. For these reports, I use 3 CPU cores.

I repeated the experiment again but using complex matrices of size 10000x10000. Still, MKL outperforms standard BLAS by far.

Intel MKL: 305.808
Standard BLAS: 2693.19

So it seems the ITPP in my case is using the Intel MKL correctly. I will write a new set of programs that run BLAS-1 and BLAS-2 operations and compare between BLAS and Intel MKL as my 802.11 generator/decoder program utilizes most of the time BLAS-1 operations as I mentioned before.

Gennady_F_Intel · ‎12-14-2020

BLAS Level 1 is not heavily optimized as BLAS-3 or BLAS-2 routines and these routines are mostly memory bandwidth limited operations. I didn't check, but my expectations that the performance with mkl and reference blas versions could be the same on the same systems.

ITPP performance with MKL is worse compared to standard BLAS and LAPACK

Performance