- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
On a dual-core win64 machine, I have some code that uses LAPACK and BLAS inside some numerical computation.s
I experimented withMKL_NUM_THREADS as follows:
Not set(i.e. let MKL use both cores):CPU time=54.5s;wall clock time=28.0s
=1: CPU time=27.6s; wall clock time=27.7s
How come letting MKL use both cores uses more CPU and has a longer wall clock time?
Note: my PC does not have HT enabled.
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Thanks Tim for your reply. So basically we are saying that this behaviour is not unexpected. Or put another way, the algorithm inside MKL that decides how many threads may not always choose the optimal number of threads?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
On a dual-core win64 machine, I have some code that uses LAPACK and BLAS inside some numerical computation.s
I experimented withMKL_NUM_THREADS as follows:
Not set(i.e. let MKL use both cores):CPU time=54.5s;wall clock time=28.0s
=1: CPU time=27.6s; wall clock time=27.7s
How come letting MKL use both cores uses more CPU and has a longer wall clock time?
Note: my PC does not have HT enabled.
Tony,
what is the typical size of your task?
--Gennnady
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Tony,
what is the typical size of your task?
--Gennnady
To be more accurate, we are using a third party linear sparse solver inside our application andthis solver makes heavy use of the BLAS - so the issue is related to BLAS, not LAPACK as I first thought. Our problem size is n=163. The sparse solver makes use of at least level 1 and 2 blas and possibly level 3 (I can check if knowing this is important).
We are using Fortran 10.0.25 and MKL 10.1.1 and this is on a windows win64 (2-core no HT) machine, but we are also seeing the same type of behaviour on linux machines (8 core HT) too.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I tried another test. I extracted a matrix from our application and set up an off-line test to solve and factorise that matrix repeatedly. Here are the results (on a dual core win64 machine):
NUM_MKL_THREADS CPU Time Wall clock
Not set 91.5 47.66
154.4 54.61
In this case, the CPU is a lot for when the 2 cores are used, but the wall clock time does go down. What this tells me is that (not suprisingly) there is a cost of mult-threading, but that cost generally pays off.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Garry,
If I understood you right, you are using third party solver and BLAS routine (Is it dgemm or another routine? ) with the square matrix (163x163). Am I right?
I guess third party solver is not mkl's routine and
Could you send us the similar performance numbers for the BLAS routine?
And one more question - what is the CPU type you are running on?
--Gennady
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Garry,
If I understood you right, you are using third party solver and BLAS routine (Is it dgemm or another routine? ) with the square matrix (163x163). Am I right?
I guess third party solver is not mkl's routine and
Could you send us the similar performance numbers for the BLAS routine?
And one more question - what is the CPU type you are running on?
--Gennady
The third party solver uses a variety of BLAS routines - I am not sure which one of these could be the culprit. The third party solver is not MKLs - it is a Fortran sparse linear solver. I need to dig into the third party code and maybe try to track it down, but it will take some time. It is likely to be DGEMM, but I amnot 100%sure.The matrix is 163x163 that the third party solving is solving, but I need to make sure that N=163 on the BLAS calls because it may be doing some partitioning.
So, what you would like is for to break the problem down and try to find out which BLAS routine is the culprit?
Im running on win64, chip details below, but we have also seen similiar behaviour on linux.
Intel Xeon CPU 5150 @ 2.66Hz, no HT
If you can confirm the next steps, I can work with you to diagnose this problem further...
thank you!
Tony
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
The third party solver uses a variety of BLAS routines - I am not sure which one of these could be the culprit. The third party solver is not MKLs - it is a Fortran sparse linear solver. I need to dig into the third party code and maybe try to track it down, but it will take some time. It is likely to be DGEMM, but I amnot 100%sure.The matrix is 163x163 that the third party solving is solving, but I need to make sure that N=163 on the BLAS calls because it may be doing some partitioning.
So, what you would like is for to break the problem down and try to find out which BLAS routine is the culprit?
Im running on win64, chip details below, but we have also seen similiar behaviour on linux.
Intel Xeon CPU 5150 @ 2.66Hz, no HT
If you can confirm the next steps, I can work with you to diagnose this problem further...
thank you!
Tony
Hi Gennnady - any update please?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Gennnady - any update please?
The solution time of 27 seconds is too slow for 163-by-163 linear system, so I assume 163 is the BLAS block size? Or are you using iterative solver?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
The solution time of 27 seconds is too slow for 163-by-163 linear system, so I assume 163 is the BLAS block size? Or are you using iterative solver?
Unfortunately, I cannot say which sparse solver were we using.
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page