Performance of cblas_ddot when incx > 1

DmitryB1 · ‎09-14-2022

We are using MKL in NumPy. We noticed that performance of cblas_ddot (running on single thread) **significantly** depends on values of incx and incy. We were able to write a simple C code that runs 2x faster than cblas_ddot when incx and incy > 1. We think that there is a bug MKL code.

Example

we have two 3-dimensional array x_dgt and y_dgt of shape (100, 70, 144). We measure performance of vectorized dot operation over 3 axis:

t: 'dgt,dgt->dg'
g: 'dgt,dgt->dt'
d: 'dgt,dgt->gt'

To compute dot we use either cblas_ddot or custom implementation of ddot that essentially unravels loop in block size of 8 elements, and assumes that -O3 option in compiler will replace the unraveled loop by AVX instruction. The attached code is attached.

cblas_dot over t: 560.7 us
my_dot over t: 674.0 us
cblas_dot over g: 1113.4 us
my_dot over g: 562.4 us
cblas_dot over d: 1277.4 us
my_dot over d: 747.0 us

As you can see, our simple code works faster than cblas_ddot when incx, incy > 1.

Attached code

We use gcc to compile the code. Here is the string:

gcc mkl_dot.c -DMKL_ILP64 -m64 -I"/opt/miniconda3/include" -L/opt/miniconda3/lib -Wl,--no-as-needed -lmkl_intel_ilp64 -lmkl_intel_thread -lmkl_core -liomp5 -lpthread -lm -ldl -O3 -o mkl_dot.o

ShanmukhS_Intel · ‎09-19-2022

Hi,

Thank you for posting on Intel Communities.

Could you please share your environment details like software version etc, so that we could look into your issue further?

Best Regards,

Shanmukh.SS

DmitryB1 · ‎09-19-2022

Sure, here it is

CentOS 7
kernel 3.10.0-1062.18.1.el7.x86_64
gcc 7.3.1 20180303 (Red Hat 7.3.1-5)
MKL 2021.4.0
CPU: Intel(R) Xeon(R) Gold 6240R CPU @ 2.40GHz

Is there anything else you would like to know about my environment?

Best,
Dmitry.

DmitryB1 · ‎09-19-2022

We execute the code about on single thread. To this end we set "export OMP_NUM_THREADS=1" and "export MKL_NUM_THREADS=1" in the terminal.

ShanmukhS_Intel · ‎09-20-2022

Hi,

We would like to inform you that performance could vary based on various scenarios like use, configuration and other factors.

1. Instruction set (MKL uses AVX2/AVX512, but reproducer uses SSE2)

2. MKL uses FMA, but the reproducer uses MUL + ADD. Or using fused instruction (load + FP instructions).

3. Unroll type

4. Frequency

We will get back to you soon with an update regarding the progress.

Best Regards,

Shanmukh.SS

ShanmukhS_Intel · ‎09-22-2022

Hi Dimitry,

Thanks for reporting this issue. We were able to reproduce it and we have informed the development team regarding the same.

Best Regards,

Shanmukh.SS

ShanmukhS_Intel · ‎12-22-2022

Hi Dmitry,

Thank you for your patience. The issue raised by you have been fixed in <2023.0> version. Please

download and let us know if this resolves your issue.

Best Regards,

Shanmukh.SS

ShanmukhS_Intel · ‎12-29-2022

Hi Dimitry,

We assume that your issue is resolved. If you need any additional information, please post a new question as this thread will no longer be monitored by Intel.

Have a great day!

Best Regards,

Shanmukh.SS