- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
We are using MKL in NumPy. We noticed that performance of cblas_ddot (running on single thread) **significantly** depends on values of incx and incy. We were able to write a simple C code that runs 2x faster than cblas_ddot when incx and incy > 1. We think that there is a bug MKL code.
Example
we have two 3-dimensional array x_dgt and y_dgt of shape (100, 70, 144). We measure performance of vectorized dot operation over 3 axis:
- t: 'dgt,dgt->dg'
- g: 'dgt,dgt->dt'
- d: 'dgt,dgt->gt'
To compute dot we use either cblas_ddot or custom implementation of ddot that essentially unravels loop in block size of 8 elements, and assumes that -O3 option in compiler will replace the unraveled loop by AVX instruction. The attached code is attached.
cblas_dot over t: 560.7 us
my_dot over t: 674.0 us
cblas_dot over g: 1113.4 us
my_dot over g: 562.4 us
cblas_dot over d: 1277.4 us
my_dot over d: 747.0 us
As you can see, our simple code works faster than cblas_ddot when incx, incy > 1.
Attached code
We use gcc to compile the code. Here is the string:
gcc mkl_dot.c -DMKL_ILP64 -m64 -I"/opt/miniconda3/include" -L/opt/miniconda3/lib -Wl,--no-as-needed -lmkl_intel_ilp64 -lmkl_intel_thread -lmkl_core -liomp5 -lpthread -lm -ldl -O3 -o mkl_dot.o
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
Thank you for posting on Intel Communities.
Could you please share your environment details like software version etc, so that we could look into your issue further?
Best Regards,
Shanmukh.SS
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Sure, here it is
- CentOS 7
- kernel 3.10.0-1062.18.1.el7.x86_64
- gcc 7.3.1 20180303 (Red Hat 7.3.1-5)
- MKL 2021.4.0
- CPU: Intel(R) Xeon(R) Gold 6240R CPU @ 2.40GHz
Is there anything else you would like to know about my environment?
Best,
Dmitry.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
We execute the code about on single thread. To this end we set "export OMP_NUM_THREADS=1" and "export MKL_NUM_THREADS=1" in the terminal.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
We would like to inform you that performance could vary based on various scenarios like use, configuration and other factors.
1. Instruction set (MKL uses AVX2/AVX512, but reproducer uses SSE2)
2. MKL uses FMA, but the reproducer uses MUL + ADD. Or using fused instruction (load + FP instructions).
3. Unroll type
4. Frequency
We will get back to you soon with an update regarding the progress.
Best Regards,
Shanmukh.SS
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Dimitry,
Thanks for reporting this issue. We were able to reproduce it and we have informed the development team regarding the same.
Best Regards,
Shanmukh.SS
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Dmitry,
Thank you for your patience. The issue raised by you have been fixed in <2023.0> version. Please
download and let us know if this resolves your issue.
Best Regards,
Shanmukh.SS
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Dimitry,
We assume that your issue is resolved. If you need any additional information, please post a new question as this thread will no longer be monitored by Intel.
Have a great day!
Best Regards,
Shanmukh.SS

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page