LAPACK MKL faster on one thread than two - how come?

Tony_Garratt · ‎09-15-2009

On a dual-core win64 machine, I have some code that uses LAPACK and BLAS inside some numerical computation.s
I experimented withMKL_NUM_THREADS as follows:

Not set(i.e. let MKL use both cores):CPU time=54.5s;wall clock time=28.0s
=1: CPU time=27.6s; wall clock time=27.7s

How come letting MKL use both cores uses more CPU and has a longer wall clock time?

Note: my PC does not have HT enabled.

TimP · ‎09-16-2009

I suppose it's not difficult to construct a case like that; there may be several ways. For example, if your case uses the entire cache with one thread, and performance is limited by cache, there may be no gain for 2 threads. According to a typical definition of CPU time (e.g. C clock() or Fortran cpu_time), it adds up the times spent in each thread. Achieving a cpu time which indicates all threads are running 100% is sometimes given as a goal.

Tony_Garratt · ‎09-16-2009

Quoting - tim18

I suppose it's not difficult to construct a case like that; there may be several ways. For example, if your case uses the entire cache with one thread, and performance is limited by cache, there may be no gain for 2 threads. According to a typical definition of CPU time (e.g. C clock() or Fortran cpu_time), it adds up the times spent in each thread. Achieving a cpu time which indicates all threads are running 100% is sometimes given as a goal.

Thanks Tim for your reply. So basically we are saying that this behaviour is not unexpected. Or put another way, the algorithm inside MKL that decides how many threads may not always choose the optimal number of threads?

TimP · ‎09-16-2009

Choice of number of threads may be based mainly on whether the problem is large enough to use all available threads. There is some cache blocking in MKL functions where it is appropriate, but I suppose cases where this won't enable multi-thread scaling wouldn't be detected.

Gennady_F_Intel · ‎09-17-2009

Quoting - Tony Garratt

On a dual-core win64 machine, I have some code that uses LAPACK and BLAS inside some numerical computation.s
I experimented withMKL_NUM_THREADS as follows:

Not set(i.e. let MKL use both cores):CPU time=54.5s;wall clock time=28.0s
=1: CPU time=27.6s; wall clock time=27.7s

How come letting MKL use both cores uses more CPU and has a longer wall clock time?

Note: my PC does not have HT enabled.

Tony,
what is the typical size of your task?
--Gennnady

Tony_Garratt · ‎09-18-2009

Quoting - Gennady Fedorov (Intel)

Tony,
what is the typical size of your task?
--Gennnady

To be more accurate, we are using a third party linear sparse solver inside our application andthis solver makes heavy use of the BLAS - so the issue is related to BLAS, not LAPACK as I first thought. Our problem size is n=163. The sparse solver makes use of at least level 1 and 2 blas and possibly level 3 (I can check if knowing this is important).

We are using Fortran 10.0.25 and MKL 10.1.1 and this is on a windows win64 (2-core no HT) machine, but we are also seeing the same type of behaviour on linux machines (8 core HT) too.

Tony_Garratt · ‎09-18-2009

I tried another test. I extracted a matrix from our application and set up an off-line test to solve and factorise that matrix repeatedly. Here are the results (on a dual core win64 machine):

NUM_MKL_THREADS CPU Time Wall clock
Not set 91.5 47.66
154.4 54.61

In this case, the CPU is a lot for when the 2 cores are used, but the wall clock time does go down. What this tells me is that (not suprisingly) there is a cost of mult-threading, but that cost generally pays off.

Gennady_F_Intel · ‎09-20-2009

Garry,
If I understood you right, you are using third party solver and BLAS routine (Is it dgemm or another routine? ) with the square matrix (163x163). Am I right?
I guess third party solver is not mkl's routine and
Could you send us the similar performance numbers for the BLAS routine?
And one more question - what is the CPU type you are running on?
--Gennady

Tony_Garratt · ‎09-21-2009

Quoting - Gennady Fedorov (Intel)

Garry,
If I understood you right, you are using third party solver and BLAS routine (Is it dgemm or another routine? ) with the square matrix (163x163). Am I right?
I guess third party solver is not mkl's routine and
Could you send us the similar performance numbers for the BLAS routine?
And one more question - what is the CPU type you are running on?
--Gennady

The third party solver uses a variety of BLAS routines - I am not sure which one of these could be the culprit. The third party solver is not MKLs - it is a Fortran sparse linear solver. I need to dig into the third party code and maybe try to track it down, but it will take some time. It is likely to be DGEMM, but I amnot 100%sure.The matrix is 163x163 that the third party solving is solving, but I need to make sure that N=163 on the BLAS calls because it may be doing some partitioning.

So, what you would like is for to break the problem down and try to find out which BLAS routine is the culprit?

Im running on win64, chip details below, but we have also seen similiar behaviour on linux.

Intel Xeon CPU 5150 @ 2.66Hz, no HT

If you can confirm the next steps, I can work with you to diagnose this problem further...

thank you!
Tony

Tony_Garratt · ‎09-23-2009

Quoting - Tony Garratt

The third party solver uses a variety of BLAS routines - I am not sure which one of these could be the culprit. The third party solver is not MKLs - it is a Fortran sparse linear solver. I need to dig into the third party code and maybe try to track it down, but it will take some time. It is likely to be DGEMM, but I amnot 100%sure.The matrix is 163x163 that the third party solving is solving, but I need to make sure that N=163 on the BLAS calls because it may be doing some partitioning.

So, what you would like is for to break the problem down and try to find out which BLAS routine is the culprit?

Im running on win64, chip details below, but we have also seen similiar behaviour on linux.

Intel Xeon CPU 5150 @ 2.66Hz, no HT

If you can confirm the next steps, I can work with you to diagnose this problem further...

thank you!
Tony

Hi Gennnady - any update please?

jaewonj · ‎09-28-2009

Quoting - Tony Garratt

Hi Gennnady - any update please?

Just out of curiosity, what is the name of the third party sparse linear system solver?

The solution time of 27 seconds is too slow for 163-by-163 linear system, so I assume 163 is the BLAS block size? Or are you using iterative solver?

Tony_Garratt · ‎11-05-2009

Quoting - jaewonj

Just out of curiosity, what is the name of the third party sparse linear system solver?

The solution time of 27 seconds is too slow for 163-by-163 linear system, so I assume 163 is the BLAS block size? Or are you using iterative solver?

Unfortunately, I cannot say which sparse solver were we using.