Pardiso with TBB threading

Daniel_S_2 · ‎03-03-2016

I'm tryng to use the Pardiso solver with the TBB treading layer.

It seems that Pardiso got alot of idle time with OMP in my kind of problems

this page say that Pardiso supports TBB

https://software.intel.com/en-us/articles/using-intel-mkl-and-intel-tbb-in-the-same-application

so I gave it a try

I'm linking with

mkl_intel_lp64_dll.lib mkl_core_dll.lib mkl_tbb_thread_dll.lib tbb.lib

and get single threaded execution. (same result with the static libs)

I'm using MVSV 2015.

what am I missing ?

tnx

D

Gennady_F_Intel · ‎03-04-2016

>> It seems that Pardiso got alot of idle time with OMP in my kind of problems.

<< what is the problem size? and could you try to take the openmp threaded version and compare the perf results?

TimP · ‎03-04-2016

The reference about using MKL with TBB appears to say that certain MKL functions are available in a TBB version, and gives a specific link command for that purpose,different from what you show here. If you use both OpenMP and TBB threading, you will expect that idle OpenMP threads persist for KMP_BLOCKTIME before a TBB thread can run on the same hardware thread.

If you are following the suggestion about tbb:affinity_partitioner and still using OpenMP as well, you might try some scheme such as limiting TBB threads to 1 per core (if you have enabled HyperThreading), taking advantage of the Intel OpenMP default limit of 1 per core, or specifically pinning OpenMP and TBB threads to different cores.

Daniel_S_2 · ‎03-04-2016

I think I have a direction.

for some reason mkl_sequential is loaded even when mkl_tbb_thread_dll.lib is linked.

so this is the reason for the single threaded times.

any idea?

as for the MOP performance, here some info:

threads time
1            1000
2            630
4            400

problem information:

0-based array is turned ON
PARDISO double precision computation is turned ON
Parallel METIS algorithm at reorder step is turned ON
Scaling is turned ON

Summary: ( reordering phase )
================

Times:
======
Time spent in calculations of symmetric matrix portrait (fulladj): 0.022640 s
Time spent in reordering of the initial matrix (reorder) : 0.398392 s
Time spent in symbolic factorization (symbfct) : 0.244236 s
Time spent in data preparations for factorization (parlist) : 0.005588 s
Time spent in allocation of internal data structures (malloc) : 0.014131 s
Time spent in additional calculations : 0.170513 s
Total time spent : 0.855501 s

Statistics:
===========
Parallel Direct Factorization is running on 1 OpenMP

< Linear system Ax = b >
number of equations: 258687
number of non-zeros in A: 2821302
number of non-zeros in A (%): 0.004216

number of right-hand sides: 1

< Factors L and U >
number of columns for each panel: 128
number of independent subgraphs: 0
number of supernodes: 50167
size of largest supernode: 1041
number of non-zeros in L: 32337906
number of non-zeros in U: 1
number of non-zeros in L+U: 32337907

Daniel_S_2 · ‎03-07-2016

UPDATE:

after stripping the project and converting to Intel compiler, mkl_tbb_thread_dll is loaded but crashes :(

here is the call stack:

>   mkl_tbb_thread.dll!00007ffa1095a067()   Unknown
    tbb.dll!tbb::internal::custom_scheduler<tbb::internal::IntelSchedulerTraits>::local_wait_for_all(tbb::task & parent, tbb::task * child) Line 467   C++
    tbb.dll!tbb::internal::arena::process(tbb::internal::generic_scheduler & s) Line 147   C++
    tbb.dll!tbb::internal::market::process(rml::job & j) Line 677   C++
    tbb.dll!tbb::internal::rml::private_worker::run() Line 276   C++
    tbb.dll!tbb::internal::rml::private_worker::thread_routine(void * arg) Line 229   C++
    ucrtbase.dll!00007ffa3dc982dd()   Unknown

probably some runtime version incompatibility.

the tbb runtime is

compilers_and_libraries_2016.2.180\windows\redist\intel64_win\tbb\vc14\tbb.dll

tested with vc_mt\tbb.dll

a simple tbb for loop works fine in the same project.

it seems that mkl_tbb_thread_dll gor for ABI compatibility issues with the tbb runtime

any idea?

tnx

D

Gennady_F_Intel · ‎03-17-2016

D.S! How could we reproduce the problem? I checked with some of Pardiso's example and linked with vc14 tbb's dll. no issues were detected.

Daniel_S_2 · ‎03-21-2016

Hello Gennady,

Just got to test it again. the crash is data dependent. tbb work fine with a few test matrices I tried , but crash in phase 11 with some of my data sets.

for the a diagonal marix with 100000, and a few fandom OD elements tbb was actually a little slower. and phase 11 seems not threaded at all.

I can send you the data with a simple code that loads it if you need it.

Daniel

Gennady_F_Intel · ‎03-21-2016

Daniel, pls try to set iparm[1]=0 instead of iparm[1]=2 (which is default) and check how it will work on your side.

Daniel_S_2 · ‎03-21-2016

still the same crash... with iparm[1] = 0,2,3

i'm using 3, that make phase 11 about x2 faster with openmp.

it seems that the data that don't crash is not using tbb threads at all

here is that crash:

Exception thrown at 0x00007FFBF459A067 (mkl_tbb_thread.dll) in MklTester.exe: 0xC0000005: Access violation writing location 0x00000025485EC000.

some time at the main thread some time on a worker thread,

stack:

>   mkl_tbb_thread.dll!00007ffbf459a067()   Unknown
    tbb.dll!tbb::internal::custom_scheduler<tbb::internal::IntelSchedulerTraits>::local_wait_for_all(tbb::task & parent, tbb::task * child) Line 467   C++
    tbb.dll!tbb::internal::arena::process(tbb::internal::generic_scheduler & s) Line 147   C++
    tbb.dll!tbb::internal::market::process(rml::job & j) Line 677   C++
    tbb.dll!tbb::internal::rml::private_worker::run() Line 276   C++
    tbb.dll!tbb::internal::rml::private_worker::thread_routine(void * arg) Line 229   C++
    ucrtbase.dll!00007ffc1e9482dd()   Unknown

Gennady_F_Intel · ‎03-30-2016

>> I can send you the data with a simple code that loads it if you need it.

Daniel, we still don't see the problem on our side with the latest version. Could you please send us these data and the code for reproducing the problem on our side.

Thanks, Gennady

Daniel_S_2 · ‎03-30-2016

Hi Gennedy,

here is the Visual studio 2015 project with data, just unzip, open sln and run.

the matrix values are empty to make the zip smaller, but same result with true values. runs ok with openmp.

Gennady_F_Intel · ‎04-01-2016

Daniel, with regard to exception with TBB threading. I checked your example with mkl 11.3 u2, and linked with TBB universal vc_mt.dll and with vc12 and vc14.

I have used the test you provided ( slightly modified by added the mkl_get_version(&Version); function ) and compiling launching from command line because MVSC 2015 is not available on my system.

all cases work fine. Below the output when vc14\tbb.dll is used:

..\_Forums\u611238_pardiso_tbb>_5tbb.exe
file mkl-860663123-00z.bin
matrix dim 258687
matrix nnz/2 2821302
64 bits
Major version: 11
Minor version: 3
Update version: 2
Product status: Product
Build: 20160120
Platform: Intel(R) 64 architecture
Processor optimization: Intel(R) Advanced Vector Extensions 2 (Intel(R) AVX2) enabled processors
================================================================

num threads 2

=== PARDISO: solving a symmetric positive definite system ===
1-based array indexing is turned ON
PARDISO double precision computation is turned ON
Parallel METIS algorithm at reorder step is turned ON
Scaling is turned ON

Summary: ( reordering phase )
================

Times:
======
Time spent in calculations of symmetric matrix portrait (fulladj): 0.026380 s
Time spent in reordering of the initial matrix (reorder) : 0.522636 s
Time spent in symbolic factorization (symbfct) : 0.264818 s
Time spent in data preparations for factorization (parlist) : 0.006496 s
Time spent in allocation of internal data structures (malloc) : 0.031902 s
Time spent in additional calculations : 0.198542 s
Total time spent : 1.050775 s

Statistics:
===========
Parallel Direct Factorization is running on 1 OpenMP

< Linear system Ax = b >
number of equations: 258687
number of non-zeros in A: 2821302
number of non-zeros in A (%): 0.004216

number of right-hand sides: 1

< Factors L and U >
number of columns for each panel: 128
number of independent subgraphs: 0
number of supernodes: 50167
size of largest supernode: 1041
number of non-zeros in L: 32337906
number of non-zeros in U: 1
number of non-zeros in L+U: 32337907

symolic factorization time is 1379 ms

Daniel_S_2 · ‎04-04-2016

Hi Gennedy,

I tested it on another computer, crash every time when using tbb.

windows 10, visual studio 2012/2015 update 1, i7 2600 and i7 4770. mkl 11.3.2.1

Ill stay with OMP for now...

tnx