I've been experimenting with the summary statistics library for computing correlation matrices. I'm using a large sample matrix (10000x6000) of random double values. I've found that the VSL_SS_METHOD_FAST algorithm does a reasonable job of quickly computing the correlation matrix. However, the algorithm does not seem to use more than 6 threads, no matter how I tweak it. I assume that this large matrix could benefit from using more threads, but even on the Intel Phi I can't persuade MKL to use more than 6.
(I've determined that 6 threads are in use both through observing my test case with top, and from setting KMP_AFFINITY=verbose and observing the output.)
I call mkl_set_num_threads(240) and mkl_set_dynamic(0) before I run my test case. After the test case, I compute the dot product of a large vector with itself using ddot. This demonstrates that MKL is able to use 240 threads - it's only the VLSS functions which seem to be restricted to 6 threads.
I've tested that the problem occurs with both VSL_SS_METHOD_FAST and VSL_SS_METHOD_1PASS, and with both VSL_SS_MATRIX_STORAGE_ROWS and VSL_SS_MATRIX_STORAGE_COLS.
How can I fully exploit available processors (particularly on the Phi) to compute correlation matrices?
(I've tried attaching a test case to this, but I keep getting an AJAX HTTP error 550 from the forum.)
We reproduced this behavior. Threading of correlation is multicriteria problem, with problem size DIMxSIZE (10000x6000) influencing not only work-size for each thread, but also high memory consumption. Current implementation is based on assumption that DIM<<SIZE, within which we limited number of threads based on SIZE number, thus increasing that part of problem size will increase number of threads used. We are constanly working on improving threading efficency of our algorithms and consider your case in future optimizations.