We are using MKL in NumPy. We noticed that performance of cblas_ddot (running on single thread) **significantly** depends on values of incx and incy. We were able to write a simple C code that runs 2x faster than cblas_ddot when incx and incy > 1. We think that there is a bug MKL code.
we have two 3-dimensional array x_dgt and y_dgt of shape (100, 70, 144). We measure performance of vectorized dot operation over 3 axis:
- t: 'dgt,dgt->dg'
- g: 'dgt,dgt->dt'
- d: 'dgt,dgt->gt'
To compute dot we use either cblas_ddot or custom implementation of ddot that essentially unravels loop in block size of 8 elements, and assumes that -O3 option in compiler will replace the unraveled loop by AVX instruction. The attached code is attached.
cblas_dot over t: 560.7 us
my_dot over t: 674.0 us
cblas_dot over g: 1113.4 us
my_dot over g: 562.4 us
cblas_dot over d: 1277.4 us
my_dot over d: 747.0 us
As you can see, our simple code works faster than cblas_ddot when incx, incy > 1.
We use gcc to compile the code. Here is the string:
gcc mkl_dot.c -DMKL_ILP64 -m64 -I"/opt/miniconda3/include" -L/opt/miniconda3/lib -Wl,--no-as-needed -lmkl_intel_ilp64 -lmkl_intel_thread -lmkl_core -liomp5 -lpthread -lm -ldl -O3 -o mkl_dot.o
Thank you for posting on Intel Communities.
Could you please share your environment details like software version etc, so that we could look into your issue further?
Sure, here it is
- CentOS 7
- kernel 3.10.0-1062.18.1.el7.x86_64
- gcc 7.3.1 20180303 (Red Hat 7.3.1-5)
- MKL 2021.4.0
- CPU: Intel(R) Xeon(R) Gold 6240R CPU @ 2.40GHz
Is there anything else you would like to know about my environment?
We would like to inform you that performance could vary based on various scenarios like use, configuration and other factors.
1. Instruction set (MKL uses AVX2/AVX512, but reproducer uses SSE2)
2. MKL uses FMA, but the reproducer uses MUL + ADD. Or using fused instruction (load + FP instructions).
3. Unroll type
We will get back to you soon with an update regarding the progress.
Thank you for your patience. The issue raised by you have been fixed in <2023.0> version. Please
download and let us know if this resolves your issue.
We assume that your issue is resolved. If you need any additional information, please post a new question as this thread will no longer be monitored by Intel.
Have a great day!