Solved: problems with cblas_daxpy and cblas_dscal

rlegarda · ‎12-26-2021

I found some problems working with cblas_daxpy and cblas_dscal routines using libMKL: I coded a C program which working with large number of variables (vector size is 1300²). I use just BLAS1 routines and everything is working fine when I use OpenBLAS and compiling with -lcblas, but when I changed to -lmkl_rt doesn't work. In my program I have only dcopy, ddot, dscal and daxpy routines, so I was checking every instruction until I found that the issues are the daxpy and the dscal routines. I tested these routines in a toy example using libMKL and working fine, but inside my program doesn't work.
In addition, I made another test: I left the -lcblas in the linker but switch between OpenBLAS and libMKL using update-alternatives and both working fine, but libMKL was evidently slower than OpenBLAS.

I'm working with a Lenovo W540 with Intel® Core™ i7-4700MQ CPU @ 2.40GHz × 8, using Debian buster and
GNU g++ 8.3.0
OpenBLAS 0.3.5+ds-3
libMKL 2019.2.187-1

Ruqiu_C_Intel · ‎05-30-2022

Thank you again for reaching us. This issue is closing and we will no longer respond to this thread. If you require additional assistance from Intel, please start a new thread. Any further interaction in this thread will be considered community only.

View solution in original post

VidyalathaB_Intel · ‎12-28-2021

Hi,

Thanks for reaching out to us.

>> but when I changed to -lmkl_rt doesn't work.

Please check with the link line advisor regarding compiling and linking options

from the below link

https://www.intel.com/content/www/us/en/developer/tools/oneapi/onemkl-link-line-advisor.html

>>libMKL 2019.2.187-1....libMKL was evidently slower than OpenBLAS.

Could you please try using the latest version of Intel MKL which is oneMKL 2022 ?

Intel MKL is available as part of the Intel® oneAPI Base Toolkit.

Link to download oneAPI Base Toolkit:

https://www.intel.com/content/www/us/en/developer/tools/oneapi/base-toolkit-download.html

Please do let us know if you face any issues by providing us a minimal reproducer (&steps to reproduce if any) along with the timings you are getting for both OpenBLAS and oneMKL so that we can work on it from our end.

Regards,

Vidya.

rlegarda · ‎12-29-2021

Thanks for your kind answer.

I answer in parts

1) Could you please try using the latest version of Intel MKL which is oneMKL 2022 ?

I'm afraid it is a little some difficult right now because I'm a little busy. For me, changes in my OS takes some time. I'm planning to do it in the near future.

2) Please check with the link line advisor...us a minimal reproducer (&steps to reproduce if any) along with the...

I made two different experiments: the first was an extensive test of the libraries; I include all the following link (https://www.dropbox.com/s/xd6s2eynpxbuoeo/experiment1.zip?dl=0). This link it will be valid for a week.

The second experiment is the main reason of my post. I'm solving a nonlinear PDE with CG, so I implemented using BLAS1; the algorithm consist of two loops: the external loop update some weights and the internal is the iteration of the CG. The following are the details of my experiments, using the compiling and linking options from the link that you provide me. I hope this is helpful (sorry for the Spanish outputs):

*************************************************
I have a 307200 x 307200 sparse matrix with 1533760 variables. I do not use any lib related with handling sparse info, I made it by myself.

First run; this is the top of my makefile
CC = g++ -O2
CFLAGS = -m64 -Wno-deprecated -Wall -ansi `pkg-config --cflags opencv`
LDFLAGS = -lm -lcblas -lrt `pkg-config --libs opencv`

This is the output
iteracion : 68 error : 4.70732e-05
iteraciones total : 34886
Tiempo empleado : 229922 mili-segundos
Normalized error: = 0.167415

****
Second run; this is the top of my makefile
CC = g++ -O2
CFLAGS = -DMKL_ILP64 -m64 -Wno-deprecated -Wall -ansi `pkg-config --cflags opencv`
LDFLAGS = -Wl,--no-as-needed -lmkl_intel_ilp64 -lmkl_sequential -lmkl_core -lpthread -lm -ldl -lrt `pkg-config --libs opencv`

This is the output
iteracion : 68 error : 4.64704e-05
iteraciones total : 34784
Tiempo empleado : 219650 mili-segundos
Normalized error: = 0.167413

****
Third run; this is the top of my makefile
CC = g++ -O2
CFLAGS = -m64 -Wno-deprecated -Wall -ansi `pkg-config --cflags opencv`
LDFLAGS = -lmkl_rt -Wl,--no-as-needed -lpthread -lm -ldl -lrt `pkg-config --libs opencv`

DOESN'T WORK

***

Forth run; this is the top of my makefile
CC = g++ -O2
CFLAGS = -m64 -Wno-deprecated -Wall -ansi `pkg-config --cflags opencv`
LDFLAGS = -lm -lcblas -lrt `pkg-config --libs opencv`

But in this case I switch from openblas to MKL using
update-alternatives --config libblas.so.3-x86_64-linux-gnu
update-alternatives --config libblas.so.-x86_64-linux-gnu

This is the output
iteracion : 67 error : 4.77133e-05
iteraciones total : 34748
Tiempo empleado : 229701 mili-segundos
Normalized error: = 0.167421

***

Fifth run; this is the top of my makefile
CC = g++ -O2
CFLAGS = -m64 -Wno-deprecated -Wall -ansi `pkg-config --cflags opencv`
LDFLAGS = -lm -lmkl_rt -lrt `pkg-config --libs opencv`

DOESN'T WORK

*******************************************************************

Again, I'm working with a Lenovo W540 with Intel® Core™ i7-4700MQ CPU @ 2.40GHz × 8, using Debian buster and
GNU g++ 8.3.0
OpenBLAS 0.3.5+ds-3
libMKL 2019.2.187-1

Regards,

Ricardo

VidyalathaB_Intel · ‎12-30-2021

Hi,

We are looking into this issue, we will get back to you soon.

Regards,

Vidya.

Ruqiu_C_Intel · ‎04-28-2022

Hi,

>In addition, I made another test: I left the -lcblas in the linker but switch between OpenBLAS and libMKL using update-alternatives and both working fine, but libMKL was evidently slower than OpenBLAS.

For this one, the reason is MKL runs multithread by default, while OpenBLAS runs in single thread. When the multithread run out of memory bandwidth, then computing speed would very slow. You can set MKL_NUM_THREADS=1 to let MKL runs in single thread.

I tested your sample code with multithread by default for MKL vs OpenBlas, the results as below.

MKL library:

root@icx01-tce:/home/rcao8/mkl_test/issue_cblas_daxpy# time ./test_mkl 10000000

iteración 0, numero de datos = 10000000

Performance (MFlops/seg), iteración 0, numero de datos = 10000000

Solucion 3 : 8388.21

Error relativo, iteración 0, numero de datos = 10000000

Solucion 3 : -6.07927e-08

real 0m1.419s

user 0m17.905s

sys 0m0.682s

OpenBlas library :

root@icx01-tce:/home/rcao8/mkl_test/issue_cblas_daxpy# time ./test_blas 10000000

iteración 0, numero de datos = 10000000

Performance (MFlops/seg), iteración 0, numero de datos = 10000000

Solucion 3 : 1922.24

Error relativo, iteración 0, numero de datos = 10000000

Solucion 3 : -6.07927e-08

real 0m6.259s

user 0m5.973s

sys 0m0.257s

Regards,

Ruqiu

Ruqiu_C_Intel · ‎05-30-2022

Thank you again for reaching us. This issue is closing and we will no longer respond to this thread. If you require additional assistance from Intel, please start a new thread. Any further interaction in this thread will be considered community only.