Dear support team!
We’ve faced some problems with QR algorithm scalability, implemented using MKL functions LAPACKE_sgehrd to reduce our matrix to Hessenberg form and LAPACKE_shseqr to perform iterations of QR algorithm itself.
Here is the code we launched on Xeon E5 v3 processor with 14 cores:
cout << "threads count: " << omp_get_max_threads() << endl;
double t1 = omp_get_wtime();
LAPACKE_sgehrd(LAPACK_ROW_MAJOR, size, 1, size, A, size, tau);
double t2 = omp_get_wtime();
cout << "LAPACKE_sgehrd time: " << t2 - t1 << " sec" << endl;
float *re = new float[size];
float *im = new float[size];
t1 = omp_get_wtime();
LAPACKE_shseqr(LAPACK_ROW_MAJOR, 'E', 'N', size, 1, size, A, size, re, im, z, size);
t2 = omp_get_wtime();
cout << "LAPACKE_shseqr time: " << t2 - t1 << " sec" << endl;
The compiler we used is icc (ICC) 15.0.3 20150407. Here are the results of launches on 1, 2, 3, 4 and 14 cores:
threads count: 1
LAPACKE_sgehrd time: 84.4017 sec
LAPACKE_shseqr time: 30.4593 sec
threads count: 2
LAPACKE_sgehrd time: 45.2026 sec
LAPACKE_shseqr time: 27.8578 sec
threads count: 3
LAPACKE_sgehrd time: 35.0818 sec
LAPACKE_shseqr time: 25.2905 sec
threads count: 4
LAPACKE_sgehrd time: 27.3022 sec
LAPACKE_shseqr time: 28.1272 sec
threads count: 14
LAPACKE_sgehrd time: 19.8118 sec
LAPACKE_shseqr time: 27.1131 sec
As it is clear, LAPACKE_sgehrd has poor scalability, while LAPACKE_shseqr has no scalability at all. The question is if there is any way we can improve the scalability of both this routines, or it its working as intended?
I find something might be wrong in your code. The input general matrix A will be reduced to upper Hessenberg form, and the A will be covered. That means, in your code, since the second time, you are not calculating same matrix as the first time. If the input data has been changed, it is not suitable to compare performance. I wrote a test case fixed this problem, you could have a test with it.
The structure of input matrix, the size of matrix would affect the performance of calculation. And sometimes, it is very common the performance of 16 threads is not better than 10 threads. For some calculation, the peak of performance is not the maximum number of threads. Normally, multiple threads is better than single thread, for some cases, 4 threads better than 2 threads. But above 6 or 8 threads, if you are not doing huge amount of calculation, the performance might be reduces. Because there's also cost for arranging works for threads, if the calculation cost is smaller than cost for arranging threads & initialization, the probably need to reduce the number of threads & choose proper number of threads to run your code.
I used icc 2017 & mkl 2017 in Xeon(R) CPU E5-2680 Linux Ubuntu with 8 cpus (16 cores), general matrix A(2000*2000) with random values:
Requesting Intel(R) MKL to use 1 thread(s) --millisecond
The 1 thread(s) performance of ?gehrd is: 1.19741
The 1 thread(s) performance of ?hesqr is: 0.01532
Requesting Intel(R) MKL to use 2 thread(s)
The 2 thread(s) performance of ?gehrd is: 0.71143
The 2 thread(s) performance of ?hesqr is: 0.00992
Requesting Intel(R) MKL to use 4 thread(s)
The 4 thread(s) performance of ?gehrd is: 0.53160
The 4 thread(s) performance of ?hesqr is: 0.00657
Requesting Intel(R) MKL to use 10 thread(s) (peak of performance)
The 10 thread(s) performance of ?gehrd is: 0.34154
The 10 thread(s) performance of ?hesqr is: 0.00393
more than 10 threads, the performance would decrease.