- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
...and one more question:
To verify the performance improvements with PARDISO, i used a certain benchmark-problem with my App. I was really happy to see, that it would be roughly two times faster than my old solver, because the old one does not make use of multicores. With this result on a Dual-Core, I expected to see some threefold increase on my Quad-Core. But amazingly the speed increase was again just a factor of two.
Dual-Core: Old Solver:72sec, PARDISO: 35sec
Quad-Core: Old Solver: 66sec, PARDISO: 32sec
(the quad-core has slightly faster clock & memory)
Does it really scale that poorly with more cores? Or am I doing something wrong?
thanks,
To verify the performance improvements with PARDISO, i used a certain benchmark-problem with my App. I was really happy to see, that it would be roughly two times faster than my old solver, because the old one does not make use of multicores. With this result on a Dual-Core, I expected to see some threefold increase on my Quad-Core. But amazingly the speed increase was again just a factor of two.
Dual-Core: Old Solver:72sec, PARDISO: 35sec
Quad-Core: Old Solver: 66sec, PARDISO: 32sec
(the quad-core has slightly faster clock & memory)
Does it really scale that poorly with more cores? Or am I doing something wrong?
thanks,
Link Copied
17 Replies
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
So little information. Did you try setting KMP_AFFINITY=compact, or some such, which is likely to be needed if you have a Core 2 QuAD running 4 threads? Are you depending on automatic selection of number of threads, and how many did it choose? If it chose 2 threads, possibly that corresponds with your problem size.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Quoting - tim18
So little information. Did you try setting KMP_AFFINITY=compact, or some such, which is likely to be needed if you have a Core 2 QuAD running 4 threads? Are you depending on automatic selection of number of threads, and how many did it choose? If it chose 2 threads, possibly that corresponds with your problem size.
here comes more:
No, I didn't use KMP_AFFINITY yet. Yes I'm using Core2-Type Processors (one System has a Duo, the other one a Quad). Using the Task-Manager System-Perormance graph, CPU load always went to (or close to) 100% with roughly equal loading on the different cores.
One more interesting thing:
I meanwhile checked with a single-core-machine as well. It looks like the improvement over my old solver (UMFPACK) is largest on the single-core, something like a factor of almost three. The difference between the old (single thread-)solver and MKL-PARDISO is only on the scale of two on the multi-core-machines ?.?...
Please don't invest any time into answering this last question, as I don't NEED to know why this is- I'm just happy it is!
But for the multi-core-issue I'd really appreciate your assistance, as it would obviously be nice being able to exploit the potential performance-benfits of multicores (especially for the future!)...
please let me know, if you need any more info!
thanks,
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
... I tried KMP_AFFINITY=granularity=fine,compact,1,0. No improvement!
I have read a lot about MKL scaling very well- what am I doing wrong?
The improvement from 1 core to 4 cores is like a 20% drop in calculation time...
???
I have read a lot about MKL scaling very well- what am I doing wrong?
The improvement from 1 core to 4 cores is like a 20% drop in calculation time...
???
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hello, could you please provide us with more details about your run?
What is the size of your matrix? Did you set the number of threads via OMP_NUM_THREADS variable or just kept it to be chosen automatically?
It would be great if you provide us with the matrix or withthe code example as well.
Konstantin
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Quoting - Konstantin Arturov (Intel)
Hello, could you please provide us with more details about your run?
What is the size of your matrix? Did you set the number of threads via OMP_NUM_THREADS variable or just kept it to be chosen automatically?
It would be great if you provide us with the matrix or withthe code example as well.
Konstantin
Hello!
It'a a 3500x3500 complex, symmetric, extremely sparse matrix. Using mtype=6
I tried the use of KMP_AFFINITY (see above), no notable change.
I used MKL_NUM_THREADS (or OMP_NUM_THREADS, I don't recall- does it make a difference?) to limit the number of threads to ONE in order to compare with the performance of all (two or four threads/cores depending on machine) threads. For the "all-case" I did not set MKL_NUM_THREADS (or OMP_NUM_THREADS).
My steps are:
Phase 11: symbolic factorization (done only once)
loop: (few hundred times)
{
Phase 22
Phase 33
}
cleanup
Do you need the matrix itself?
thanks,
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- reply showed up twice -
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Quoting - Perpf2000
It'a a 3500x3500 complex, symmetric, extremely sparse matrix. Using mtype=6
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Quoting - tim18
Can we assume then that it has a high cache miss rate? That would match the symptom that performance depends more on memory speed than on number of cores.
and what could I do about it?
tnx
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Quoting - Perpf2000
Hello!
It'a a 3500x3500 complex, symmetric, extremely sparse matrix. Using mtype=6
I tried the use of KMP_AFFINITY (see above), no notable change.
I used MKL_NUM_THREADS (or OMP_NUM_THREADS, I don't recall- does it make a difference?) to limit the number of threads to ONE in order to compare with the performance of all (two or four threads/cores depending on machine) threads. For the "all-case" I did not set MKL_NUM_THREADS (or OMP_NUM_THREADS).
My steps are:
Phase 11: symbolic factorization (done only once)
loop: (few hundred times)
{
Phase 22
Phase 33
}
cleanup
Do you need the matrix itself?
thanks,
In fact, the matrix isvery small. I mean that thereare few computations to supply all threads with enough amount of work. Anyway, wewill appreciate if you send your testcase (matrix) to us. We would add it to the testbase of user testcases and could track it in future releases.
Thanks,
Konstantin
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Quoting - Konstantin Arturov (Intel)
In fact, the matrix isvery small. I mean that thereare few computations to supply all threads with enough amount of work. Anyway, wewill appreciate if you send your testcase (matrix) to us. We would add it to the testbase of user testcases and could track it in future releases.
Thanks,
Konstantin
if there was'n enough work for every thread, why are (on the Quad-Core) all four threads 100%-busy for many seconds at each step of the loop?
please clarify! thank you!
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Quoting - Perpf2000
Konstantin,
if there was'n enough work for every thread, why are (on the Quad-Core) all four threads 100%-busy for many seconds at each step of the loop?
please clarify! thank you!
if there was'n enough work for every thread, why are (on the Quad-Core) all four threads 100%-busy for many seconds at each step of the loop?
please clarify! thank you!
I don't know the exact answer and can only guess. It will be easier for me to take a look at the test. Please, send your matrixin order I can make some runs.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Quoting - Konstantin Arturov (Intel)
I don't know the exact answer and can only guess. It will be easier for me to take a look at the test. Please, send your matrixin order I can make some runs.
OK! Understand that!
What format should the matrix be? I could probably generate a text file with the Complex entries as CSV or samoething like that. What exactly would you need?
tnx
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Quoting - Perpf2000
OK! Understand that!
What format should the matrix be? I could probably generate a text file with the Complex entries as CSV or samoething like that. What exactly would you need?
tnx
If you don't have the matrix in HB format or smth like this, you could just senda text file, coordinate format would be Ok: a sequence of strings "i j value".
Thanks,
Konstantin
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Quoting - Konstantin Arturov (Intel)
If you don't have the matrix in HB format or smth like this, you could just senda text file, coordinate format would be Ok: a sequence of strings "i j value".
Thanks,
Konstantin
OK!
I will be out of town for a while; so it will take a little time until I can send the data. But I#ll get back to this!
Thanks so far!
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hello, I know this thread is old, but I am having the same difficulties. See Pardiso Crash, user=h88433.
Was this issue ever cleared up ?
Was this issue ever cleared up ?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi!
Well, it's been a while. I had to abandon this issue for a while as I was busy with other stuff.
No, the issue wasn't yet cleared up. Im just now getting back to this problem and the poor scaling still troubles me.
regards
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Nils,
Are you trying the latest MKL version ( MKL 11.0 update 1) with the matrix? You may sumbit a issue to <> if any problem with uploading your matrix here.
Best Regards,
Ying
Reply
Topic Options
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page