I've discovered a strange performance problem: replacing a daxpy() call with daxpby() significantly degrades program scalability. I've written a simple test case (hopefully attached) which shows that, while daxpy() performance increases as I increase the number of threads, daxpby() performance remains completely unchanged. It's as if daxpby() isn't parallelized!
I'm using dense, aligned, million-element vectors of doubles. The vectors are uninitialized, but that shouldn't matter, and I'm using the same vectors for both calls.
I see the same behavior with mkl_2013.0.079 and mkl_2013_sp1.1.106.
Here's my program's output:
Threads=1; 1450.095253 daxpy() calls/sec
Threads=1; 1391.271988 daxpby() calls/sec
Threads=2; 2810.048711 daxpy() calls/sec
Threads=2; 1387.726834 daxpby() calls/sec
Threads=3; 4056.252211 daxpy() calls/sec
Threads=3; 1371.165840 daxpby() calls/sec
Threads=4; 5288.407756 daxpy() calls/sec
Threads=4; 1390.385973 daxpby() calls/sec
Threads=5; 6248.746533 daxpy() calls/sec
Threads=5; 1394.361390 daxpby() calls/sec
Threads=6; 7418.605490 daxpy() calls/sec
Threads=6; 1388.381384 daxpby() calls/sec
Threads=7; 8796.486642 daxpy() calls/sec
Threads=7; 1384.350895 daxpby() calls/sec
Threads=8; 9896.086698 daxpy() calls/sec
Threads=8; 1387.321742 daxpby() calls/sec
Note that the number of daxpy() calls per second increases as the number of threads increases, while the daxpby() throughput remains constant.
Thanks, Peter, for the reproducer and for bringing this matter to our attention. You are right – daxpy is parallelized, while daxpby is not. We will look into this matter. Any further information you can provide as to use cases (number of elements, number of threads, platforms used, etc.) would be helpful. Thank you!
I've tested this on a 22.214.171.124 Linux Xeon E5-2660 system, a 3.4.63 Linux Xeon X5690 system, and a Xeon Phi coprocessor. All three demonstrated the same problem.
I believe that I can work around the problem by chunking the vector manually, iterating over the chunks in an OpenMP parallel for loop, and calling daxpby for each chunk.
In general, we expect daxpby performance to be identical to daxpy performance when b=1.0, and we expect the performance to scale similarly for b != 1.0. (In fact, I'm a bit surprised that daxpy isn't just a wrapper for daxpby.)