dtrnlsp_solve spinning/sleeping when called from multiple threads

Steven_H_1 · ‎09-08-2015

I am using the trust region solver in MKL and having issues where dtrnlsp_solve takes significantly longer to complete. I have got many threads that all need to run an optimization using the trust region solver, each optimization problem has about 200 residuals and about 40-70 unknowns. When I get to a high number of threads needing to perform the optimization I start to see (though concurrency profiling) that many of the threads are blocked in the solve for up to 20 times longer than a normal solve. I start to see this behaviour when I have about 40-60 threads which could call the trust region solver. I have tried two versions of MKL. Initially I was using version 11.1.2 and seeing the trust region threads spinning with a call stack ending in mkl_serv_lock <- mkl_serv_deallocate. I then tried version 11.3.0 and saw the threads spinning or sleeping in tbb under mkl_serv_allocate.

I'm using external threading so I am running MKL in sequential mode. I'm also using the tbb allocator and 64 bit versions.

Ideally I would like to find a solution that works for MKL version 11.1.2. There appears to be a small change in the solution produced by the optimization between 11.1.2 and 11.3.0 with the older version appearing to converge to a smaller overall error.

Thanks in advance

Steven

Gennady_F_Intel · ‎09-08-2015

Stephen, How many of external threads You create while calling the sequential version of mkl's routine ? and How many of threads available on your system?

Steven_H_1 · ‎09-09-2015

Hi Gennady

Thanks for getting back to me. I originally noticed the problem in a application with about 60 external threads calling MKL for part of their processing. This was running on an 8 core i7 with hypethreading.

I have now run tests on three different computers. One with quad core i7, 8Gb RAM, hyperthreading turned off. Second with 8 core i7, 16Gb RAM, hyperthreading turned on. Third with two 6 core Xeon, 12Gb RAM, hyperthreading turned off. I have run the same test on all three with 4, 8, 16, 20, 40 and 80 threads. In each test the total computation required is the same. For all three machines I see very similar behaviour. In the following results the processing times are approximate and relative to the processing time for 4 threads on that machine. These results were collected with MKL 11.1.2. These are results from my test setup.

Quad core

Threads Processing Time Locking Observed

4 1.0 No

8 0.9 No

16 0.9 No

20 0.9 Infrequent

40 1.1 Yes

80 1.3 Yes

8 Core

Threads Processing Time Locking Observed

4 1.0 No

8 0.6 No

16 0.4 No

20 0.4 Infrequent

40 0.5 Yes

80 1.4 Yes

12 Core

Threads Processing Time Locking Observed

4 1.0 No

8 0.6 No

16 0.4 Infrequent

20 0.4 Infrequent

40 0.4 Yes

80 0.8 Yes