Pardiso 24 cores poor scaling

Karel_T_ · ‎02-15-2016

I am using Pardiso on a machine with two twelve-core processors. HT is disabled in BIOS. Upto 12threads the CPU time scales reasonably, but 24threads is only 10% faster than with 12threads on one processor. Is it correct? I am using iparm(24)=1, system variables MKL_DYNAMIC=FALSE, MKL_NUM_THREADS=24 (nothing changes even when these variables are not defined). Pardiso is called from AceFEM. Now, I can only change the parameters, but I can contact the author if needed. Where could be a mistake? It seems like only one processor is used, although task manager shows that all cores are fully loaded.

Thanks a lot! Karel

Ying_H_Intel · ‎02-15-2016

Hi Karel,

1) could you please tell some information like which MKL version was using? windows, linux or other, Intel 64 bit or 32bit, C or fortran etc?

According to MKL manual, iparm[23]

input
Parallel factorization control.
NOTE
The two-level factorization algorithm does not improve performance in OOC mode.
0* Intel MKL PARDISO uses the classic algorithm for factorization.
1 Intel MKL PARDISO uses a two-level factorization algorithm. This algorithm
generally improves scalability in case of parallel factorization on many OpenMP

2) How was the input sparse matrix size? You may set msglvl=1 and show the output?

MKL provide some benchmark about pardiso, https://software.intel.com/en-us/intel-mkl/benchmarks#Parallell.

3) Additionally, do you have the spare matrix stored in file? If yes, you can try pardiso.f under MKL install directory, i.e

C:\Program Files (x86)\IntelSWTools\compilers_and_libraries_2016.1.146\windows\mkl\examples directly.

Best Regards,

Ying

Karel_T_ · ‎02-16-2016

Hi, thanks for reply.

1a) MKL 11.3, Win 8.1, 64bit, C.

1b) I found this parameter yesterday, it improves the CPU time by 10%, everything is stored in RAM.

2) The matrix comes from FEM, it is not symmetric and not positive definite (matrix type 11). When it comes from 2D problem, the connectivity is not very large and we solve problems between 1mil and 8mil, in 3D (higher connectivity, less sparse than in 2D) we solve problems between 200k and 1mil. In all cases solution on one CPU is less than 10% faster than on both CPUs. As an example: size of the matrix 4 327 200, number of non-zero entries is 116 704 670. Now, I can not set msglvl=1, what information would it provide to me?

3) Matrix is built in RAM, I will try to do the benchmark.

The main reason why I am solving this problem is that the computer with 2x Xeon E5-2680 v3 was much more expensive than the computer with overclocked 5960X and the difference in CPU time between them is only around 20%. So I would like to know if there is some reason to buy such a computer next time. Btw. is RAM with ECC a big advantage?

Thanks, Karel.

Ying_H_Intel · ‎02-21-2016

Hi Karel,

Thanks for the reply. As i understand, by default, it is not expected that " 24threads is only 10% faster than with 12threads on one processor", but it depends on the solver size, sparsity and cpu memory size etc. So we need to test case to verify.

If set msglvl=1, you will see the solver's message as https://software.intel.com/en-us/forums/intel-math-kernel-library/topic/601183 attached. or Is there any way for provide us one input (write one into txt file) , so we can did standalone test at our sides?

We release MKL 11.3.2 this week, we had gotten one performance issue in MKL 11.3 and 11.3.1 , please see https://software.intel.com/en-us/articles/intel-mkl-113-bug-fixes-list. would you please try the version and show the performance comparsion.

You can get MKL 11.3.2 by intel registration center, https://registrationcenter.intel.com/en/

Best Regards,

Ying

TimP · ‎02-21-2016

Did you investigate whether setting affinity e.g. OMP_PROC_BIND or OMP_PLACES will improve dual CPU performance?