Intel® oneAPI Math Kernel Library
Ask questions and share information with other developers who use Intel® Math Kernel Library.

daxpby doesn't scale

Peter_B_9
Beginner
631 Views

I've discovered a strange performance problem: replacing a daxpy() call with daxpby() significantly degrades program scalability. I've written a simple test case (hopefully attached) which shows that, while daxpy() performance increases as I increase the number of threads, daxpby() performance remains completely unchanged. It's as if daxpby() isn't parallelized!

I'm using dense, aligned, million-element vectors of doubles. The vectors are uninitialized, but that shouldn't matter, and I'm using the same vectors for both calls.

I see the same behavior with mkl_2013.0.079 and mkl_2013_sp1.1.106.

Here's my program's output:

Threads=1; 1450.095253 daxpy() calls/sec                                                                                                                                               
Threads=1; 1391.271988 daxpby() calls/sec                                                                                                                                              
Threads=2; 2810.048711 daxpy() calls/sec                                                                                                                                               
Threads=2; 1387.726834 daxpby() calls/sec                                                                                                                                              
Threads=3; 4056.252211 daxpy() calls/sec                                                                                                                                               
Threads=3; 1371.165840 daxpby() calls/sec                                                                                                                                              
Threads=4; 5288.407756 daxpy() calls/sec                                                                                                                                               
Threads=4; 1390.385973 daxpby() calls/sec                                                                                                                                              
Threads=5; 6248.746533 daxpy() calls/sec                                                                                                                                               
Threads=5; 1394.361390 daxpby() calls/sec                                                                                                                                              
Threads=6; 7418.605490 daxpy() calls/sec                                                                                                                                               
Threads=6; 1388.381384 daxpby() calls/sec                                                                                                                                              
Threads=7; 8796.486642 daxpy() calls/sec                                                                                                                                               
Threads=7; 1384.350895 daxpby() calls/sec                                                                                                                                              
Threads=8; 9896.086698 daxpy() calls/sec                                                                                                                                               
Threads=8; 1387.321742 daxpby() calls/sec                                                                                                                                          

 

Note that the number of daxpy() calls per second increases as the number of threads increases, while the daxpby() throughput remains constant.

0 Kudos
4 Replies
Murat_G_Intel
Employee
631 Views

Thanks, Peter, for the reproducer and for bringing this matter to our attention.  You are right – daxpy is parallelized, while daxpby is not.  We will look into this matter.  Any further information you can provide as to use cases (number of elements, number of threads, platforms used, etc.) would be helpful.  Thank you!

0 Kudos
Peter_B_9
Beginner
631 Views

Hi Murat,

I've tested this on a 2.6.38.2 Linux Xeon E5-2660 system, a 3.4.63 Linux Xeon X5690 system, and a Xeon Phi coprocessor. All three demonstrated the same problem.

I believe that I can work around the problem by chunking the vector manually, iterating over the chunks in an OpenMP parallel for loop, and calling daxpby for each chunk.

In general, we expect daxpby performance to be identical to daxpy performance ​when b=1.0, and we expect the performance to scale similarly for b != 1.0. (In fact, I'm a bit surprised that daxpy isn't just a wrapper for daxpby.) 

0 Kudos
Murat_G_Intel
Employee
631 Views

Hi Peter,

Thank you for the additional information! Your expectations are reasonable. We will consider parallelization of daxpby in our next planning meeting.

0 Kudos
Gennady_F_Intel
Moderator
631 Views

Hi Peter,

please check the threaded implementation into the latest versions of  MKL 11.2 and 11.1 Update 4 which were released recently.

--Gennady

 

0 Kudos
Reply