Intel® oneAPI Math Kernel Library
Ask questions and share information with other developers who use Intel® Math Kernel Library.

Performance of cblas_ddot when incx > 1

DmitryB1
Beginner
1,205 Views

We are using MKL in NumPy.   We noticed that performance of cblas_ddot (running on single thread) **significantly** depends on values of incx and incy.  We were able to write a simple C code that runs 2x faster than cblas_ddot when incx and incy > 1.   We think that there is a bug  MKL code. 

 

Example

we have two 3-dimensional array x_dgt and y_dgt of shape (100, 70, 144).  We measure performance of vectorized dot operation over 3 axis:

  • t: 'dgt,dgt->dg'
  • g: 'dgt,dgt->dt'
  • d: 'dgt,dgt->gt'

To compute dot we use either cblas_ddot or custom implementation of ddot that essentially unravels loop in block size of 8 elements, and assumes that -O3 option in compiler will replace the unraveled loop by AVX instruction.   The attached code is attached.

 

cblas_dot over t:  560.7 us
   my_dot over t:  674.0 us
cblas_dot over g: 1113.4 us
   my_dot over g:  562.4 us
cblas_dot over d: 1277.4 us
   my_dot over d:  747.0 us

 

As you can see, our simple code works faster than cblas_ddot when incx, incy > 1. 

 

Attached code

We use gcc to compile the code.  Here is the string:

gcc mkl_dot.c -DMKL_ILP64 -m64 -I"/opt/miniconda3/include" -L/opt/miniconda3/lib -Wl,--no-as-needed -lmkl_intel_ilp64 -lmkl_intel_thread -lmkl_core -liomp5 -lpthread -lm -ldl -O3 -o mkl_dot.o

 

Labels (1)
0 Kudos
7 Replies
ShanmukhS_Intel
Moderator
1,148 Views

Hi,


Thank you for posting on Intel Communities.


Could you please share your environment details like software version etc, so that we could look into your issue further?


Best Regards,

Shanmukh.SS


0 Kudos
DmitryB1
Beginner
1,136 Views

Sure, here it is

  • CentOS 7
  • kernel 3.10.0-1062.18.1.el7.x86_64
  • gcc 7.3.1 20180303 (Red Hat 7.3.1-5)
  • MKL 2021.4.0
  • CPU: Intel(R) Xeon(R) Gold 6240R CPU @ 2.40GHz

Is there anything else you would like to know about my environment?

Best,
Dmitry.

 

0 Kudos
DmitryB1
Beginner
1,136 Views

We execute the code about on single thread. To this end we set "export OMP_NUM_THREADS=1" and "export MKL_NUM_THREADS=1" in the terminal.

0 Kudos
ShanmukhS_Intel
Moderator
1,122 Views

Hi,

 

We would like to inform you that performance could vary based on various scenarios like use, configuration and other factors.

 

1. Instruction set (MKL uses AVX2/AVX512, but reproducer uses SSE2)

2. MKL uses FMA, but the reproducer uses MUL + ADD. Or using fused instruction (load + FP instructions).

3. Unroll type

4. Frequency

 

We will get back to you soon with an update regarding the progress.

 

Best Regards,

Shanmukh.SS

 

0 Kudos
ShanmukhS_Intel
Moderator
1,051 Views

Hi Dimitry,


Thanks for reporting this issue. We were able to reproduce it and we have informed the development team regarding the same.


Best Regards,

Shanmukh.SS


0 Kudos
ShanmukhS_Intel
Moderator
807 Views

Hi Dmitry,


Thank you for your patience. The issue raised by you have been fixed in <2023.0> version. Please

download and let us know if this resolves your issue.


Best Regards,

Shanmukh.SS


0 Kudos
ShanmukhS_Intel
Moderator
747 Views

Hi Dimitry,


We assume that your issue is resolved. If you need any additional information, please post a new question as this thread will no longer be monitored by Intel.


Have a great day!


Best Regards,

Shanmukh.SS


0 Kudos
Reply