Solved: Pardiso thread, vs. core, usage

Greg_M_ · ‎12-20-2016

I'm wondering if it's possible to run Pardiso on more than one thread per core on linux, or if certain behind-the-scenes optimisations have been set. That is, with the following env:

$ env | grep PARDISO

MKL_DOMAIN_NUM_THREADS=MKL_DOMAIN_PARDISO=56

And with the following hardware configuration, per (excerpted) /proc/cpuinfo:

processor : 55

vendor_id : GenuineIntel

cpu family : 6

model : 63

model name : Intel(R) Xeon(R) CPU E5-2695 v3 @ 2.30GHz

Our 2x14 core machine still only produces the following (excerpted) Pardiso call summary:

Statistics:

===========

Parallel Direct Factorization is running on 28 OpenMP

< Linear system Ax = b >

number of equations: 89317

number of non-zeros in A: 57771613

number of non-zeros in A (%): 0.724180

BTW, our OS is CentOS release 6.6, with 64 bit icc, composer_xe_2015.3.187

Any info much appreciated,

-Greg

Zhang_Z_Intel · ‎12-21-2016

Greg,

Your understanding is mostly correct. By default, MKL_DYNAMIC is TRUE and all MKL domains use no more than the total number of threads available on the system. Sometimes, a function may use fewer threads.. The "total number of threads" MKL sees is the number reported by the OS. So it could be the same as the number of physical cores (when hyper-threading is off) or 2x the number of physical cores (when hyper-threading is on). Typically, functions that are CPU bound (e.g. large dense matrix operations) tend to give better performance if the number of threads is the same as the number of physical cores. Memory bound functions are not as sensitive to hyperthreading and may even get better performance by using more threads. You need careful benchmarking to make good decisions.

Beside the number of threads, thread affinity is another factor that affects performance. This is especially true for multi-socket systems. No matter how many threads you use, you are better off binding application threads to physical cores. This helps to avoid thread migration and improves data locality. You can use KMP_AFFINITY (for Intel OpenMP) to set affinity. You can also use OpenMP runtime API. Please consult OpenMP documentation.

Thanks.

View solution in original post

Zhang_Z_Intel · ‎12-20-2016

To turn off MKL "behind-the-scenes" heuristics of OMP threads, you should set env-variable MKL_DYNAMIC=0. This tells MKL to use the number of threads you specified.

Greg_M_ · ‎12-20-2016

Thanks very much for your reply Zhang Zhang.

So, if I understand correctly, even though 56 Pardiso threads had been requested via the environment variable setting, "export MKL_DOMAIN_NUM_THREADS='MKL_DOMAIN_PARDISO=56' ", the number of threads actually used corresponded to the number of cores since MKL_DYNAMIC is, by default true?

(The MKL documentation seems to suggest as much: "For example, MKL_DYNAMIC set to TRUE enables optimal choice of the number of threads in the following cases: - If the requested number of threads exceeds the number of physical cores (perhaps because of using the Intel® Hyper-Threading Technology), Intel MKL scales down the number of OpenMP threads to the number of physical cores."

So, if that's the case, this would have been standard behaviour for all MKL domains and not just Pardiso?

Of course, if the above is all true, it begs the question why MKL thread count would default to core count when HTT is in use. Should one, in general, expect optimal MKL behaviour when thread count equals core count, or when thread count is equal to the HTT count?

Thanks again Zhang Zhang, and best regards, -Greg

Zhang_Z_Intel · ‎12-21-2016

Greg,

Your understanding is mostly correct. By default, MKL_DYNAMIC is TRUE and all MKL domains use no more than the total number of threads available on the system. Sometimes, a function may use fewer threads.. The "total number of threads" MKL sees is the number reported by the OS. So it could be the same as the number of physical cores (when hyper-threading is off) or 2x the number of physical cores (when hyper-threading is on). Typically, functions that are CPU bound (e.g. large dense matrix operations) tend to give better performance if the number of threads is the same as the number of physical cores. Memory bound functions are not as sensitive to hyperthreading and may even get better performance by using more threads. You need careful benchmarking to make good decisions.

Beside the number of threads, thread affinity is another factor that affects performance. This is especially true for multi-socket systems. No matter how many threads you use, you are better off binding application threads to physical cores. This helps to avoid thread migration and improves data locality. You can use KMP_AFFINITY (for Intel OpenMP) to set affinity. You can also use OpenMP runtime API. Please consult OpenMP documentation.

Thanks.

Greg_M_ · ‎12-22-2016

Zhang Zhang,

Thanks for the detailed, informative reply!

Best regards,

-Greg