Community
cancel
Showing results for 
Search instead for 
Did you mean: 
Daniel_S_2
Beginner
221 Views

Pardiso with TBB threading

I'm tryng to use the Pardiso  solver with the TBB treading layer.

It seems that Pardiso  got alot of idle time with OMP in my kind of problems

this page say that Pardiso  supports TBB

https://software.intel.com/en-us/articles/using-intel-mkl-and-intel-tbb-in-the-same-application

 

so I gave it a try

I'm linking with

mkl_intel_lp64_dll.lib mkl_core_dll.lib mkl_tbb_thread_dll.lib tbb.lib

and get single threaded execution. (same result with the static libs)

I'm using MVSV 2015.  

 

what am I missing ?

 

tnx

D

0 Kudos
12 Replies
Gennady_F_Intel
Moderator
221 Views

>> It seems that Pardiso  got alot of idle time with OMP in my kind of problems.

<< what is the problem size? and could you try to take the openmp threaded version and compare the perf results? 

TimP
Black Belt
221 Views

The reference about using MKL with TBB appears to say that certain MKL functions are available in a TBB version, and gives a specific link command for that purpose,different from what you show here.  If you use both OpenMP and TBB threading, you will expect that idle OpenMP threads persist for KMP_BLOCKTIME before a TBB thread can run on the same hardware thread.

If you are following the suggestion about tbb:affinity_partitioner and still using OpenMP as well, you might try some scheme such as limiting TBB threads to 1 per core (if you have enabled HyperThreading), taking advantage of the Intel OpenMP default limit of 1 per core, or specifically pinning OpenMP and TBB threads to different cores.

Daniel_S_2
Beginner
221 Views

I think I have a direction. 

for some reason mkl_sequential is loaded even when mkl_tbb_thread_dll.lib is linked.

so this is the reason for the single threaded times.

any idea?  

 

as for the MOP performance, here some info:

threads  time
1             1000
2             630
4             400

problem information:

0-based array is turned ON
PARDISO double precision computation is turned ON
Parallel METIS algorithm at reorder step is turned ON
Scaling is turned ON


Summary: ( reordering phase )
================

Times:
======
Time spent in calculations of symmetric matrix portrait (fulladj): 0.022640 s
Time spent in reordering of the initial matrix (reorder)         : 0.398392 s
Time spent in symbolic factorization (symbfct)                   : 0.244236 s
Time spent in data preparations for factorization (parlist)      : 0.005588 s
Time spent in allocation of internal data structures (malloc)    : 0.014131 s
Time spent in additional calculations                            : 0.170513 s
Total time spent                                                 : 0.855501 s

Statistics:
===========
Parallel Direct Factorization is running on 1 OpenMP

< Linear system Ax = b >
             number of equations:           258687
             number of non-zeros in A:      2821302
             number of non-zeros in A (%): 0.004216

             number of right-hand sides:    1

< Factors L and U >
             number of columns for each panel: 128
             number of independent subgraphs:  0
             number of supernodes:                    50167
             size of largest supernode:               1041
             number of non-zeros in L:                32337906
             number of non-zeros in U:                1
             number of non-zeros in L+U:              32337907


             

 

Daniel_S_2
Beginner
221 Views

UPDATE:

after stripping the project and converting to Intel compiler,  mkl_tbb_thread_dll is loaded but crashes :(

 

here is the call stack:

>    mkl_tbb_thread.dll!00007ffa1095a067()    Unknown
     tbb.dll!tbb::internal::custom_scheduler<tbb::internal::IntelSchedulerTraits>::local_wait_for_all(tbb::task & parent, tbb::task * child) Line 467    C++
     tbb.dll!tbb::internal::arena::process(tbb::internal::generic_scheduler & s) Line 147    C++
     tbb.dll!tbb::internal::market::process(rml::job & j) Line 677    C++
     tbb.dll!tbb::internal::rml::private_worker::run() Line 276    C++
     tbb.dll!tbb::internal::rml::private_worker::thread_routine(void * arg) Line 229    C++
     ucrtbase.dll!00007ffa3dc982dd()    Unknown

 

probably some runtime version incompatibility.

the tbb runtime is 

compilers_and_libraries_2016.2.180\windows\redist\intel64_win\tbb\vc14\tbb.dll

tested with vc_mt\tbb.dll

a simple tbb for loop works fine in the same project.

it seems that mkl_tbb_thread_dll  gor for ABI compatibility issues with the tbb runtime

any idea?

 

tnx 

D

Gennady_F_Intel
Moderator
221 Views

D.S! How could we reproduce the problem? I checked with some of Pardiso's example and linked with vc14 tbb's dll. no issues were detected.

 

 

Daniel_S_2
Beginner
221 Views

Hello Gennady,

Just got to test it again.  the crash is data dependent.  tbb work fine with a few test matrices I tried , but crash in phase 11 with some of my data sets.

for the a diagonal marix with 100000, and a few fandom OD elements tbb was actually a little slower.   and phase 11 seems not threaded at all.

I can send you the data with a simple code that loads it if you need it.

Daniel

 

Gennady_F_Intel
Moderator
221 Views

Daniel, pls try to set iparm[1]=0 instead of iparm[1]=2 (which is default) and check how it will work on your side.

Daniel_S_2
Beginner
221 Views

still the same crash... with  iparm[1] = 0,2,3

i'm using 3, that make phase 11 about x2 faster with openmp.

it seems that the data that don't crash is not using tbb threads at all

 

 

here is that crash:

Exception thrown at 0x00007FFBF459A067 (mkl_tbb_thread.dll) in MklTester.exe: 0xC0000005: Access violation writing location 0x00000025485EC000.

some time at the main thread some time on  a worker thread,

stack:

>    mkl_tbb_thread.dll!00007ffbf459a067()    Unknown
     tbb.dll!tbb::internal::custom_scheduler<tbb::internal::IntelSchedulerTraits>::local_wait_for_all(tbb::task & parent, tbb::task * child) Line 467    C++
     tbb.dll!tbb::internal::arena::process(tbb::internal::generic_scheduler & s) Line 147    C++
     tbb.dll!tbb::internal::market::process(rml::job & j) Line 677    C++
     tbb.dll!tbb::internal::rml::private_worker::run() Line 276    C++
     tbb.dll!tbb::internal::rml::private_worker::thread_routine(void * arg) Line 229    C++
     ucrtbase.dll!00007ffc1e9482dd()    Unknown

 

Gennady_F_Intel
Moderator
221 Views

>> I can send you the data with a simple code that loads it if you need it. 

Daniel, we still don't see the problem on our side with the latest version.  Could you please send us these data and the code for reproducing the problem on our side.

Thanks, Gennady

Daniel_S_2
Beginner
221 Views

Hi Gennedy,

here is the Visual studio 2015 project with data,  just unzip, open sln and run.

the matrix values are empty to make the zip smaller, but same result with true values.   runs ok with openmp.

 

Gennady_F_Intel
Moderator
221 Views

Daniel, with regard to exception with TBB threading. I checked your example with mkl 11.3 u2, and linked with TBB universal vc_mt.dll and with vc12 and vc14.  

I have used the test you provided ( slightly modified by added the mkl_get_version(&Version);  function ) and  compiling launching from command line because MVSC 2015 is not available on my system. 

all cases work fine. Below the output when vc14\tbb.dll is used:


..\_Forums\u611238_pardiso_tbb>_5tbb.exe
file mkl-860663123-00z.bin
matrix dim 258687
matrix nnz/2 2821302
64 bits
Major version:           11
Minor version:           3
Update version:          2
Product status:          Product
Build:                   20160120
Platform:                Intel(R) 64 architecture
Processor optimization:  Intel(R) Advanced Vector Extensions 2 (Intel(R) AVX2) enabled processors
================================================================

num threads 2

=== PARDISO: solving a symmetric positive definite system ===
1-based array indexing is turned ON
PARDISO double precision computation is turned ON
Parallel METIS algorithm at reorder step is turned ON
Scaling is turned ON

Summary: ( reordering phase )
================

Times:
======
Time spent in calculations of symmetric matrix portrait (fulladj): 0.026380 s
Time spent in reordering of the initial matrix (reorder)         : 0.522636 s
Time spent in symbolic factorization (symbfct)                   : 0.264818 s
Time spent in data preparations for factorization (parlist)      : 0.006496 s
Time spent in allocation of internal data structures (malloc)    : 0.031902 s
Time spent in additional calculations                            : 0.198542 s
Total time spent                                                 : 1.050775 s

Statistics:
===========
Parallel Direct Factorization is running on 1 OpenMP

< Linear system Ax = b >
             number of equations:           258687
             number of non-zeros in A:      2821302
             number of non-zeros in A (%): 0.004216

             number of right-hand sides:    1

< Factors L and U >
             number of columns for each panel: 128
             number of independent subgraphs:  0
             number of supernodes:                    50167
             size of largest supernode:               1041
             number of non-zeros in L:                32337906
             number of non-zeros in U:                1
             number of non-zeros in L+U:              32337907


symolic   factorization time is  1379 ms

 

Daniel_S_2
Beginner
221 Views

Hi Gennedy,

I tested it on another computer,  crash every time when using tbb.

windows 10, visual studio 2012/2015 update 1,  i7 2600 and i7 4770.   mkl 11.3.2.1

Ill stay with OMP for now...

tnx

Reply