- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Itook the implementation of PARDISO mrhs version
into our application.
I followed Kalinkin's instruction.
The solver was operated normaly and resultswere correct.
But problem issolving time.
In phase=33, with single rhs case
the solve time was 0.03450 s.
But, with 4-rhs case
the solve time was 0.532629 s.
I expected the almost same run-time,
however 4-rhs case was 15 times slower than
single rhs.
I heard that we canreduce run-time using multiple rhs.
I tested some other test cases, but results were similar.
What are the overhead factors influenced such results?
Is it right thatthe PARDISO provides the multi threading in phase=33 solve part?
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Dear Konstantin.
MKL version is the latest version and
O/S is Redhat_AS4_U7,
CPU is Intel Xeon 3GHz 4-core
Memory is 64Gb
Thank you.
Regards.
B. Hwang
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
We have many cases, 10000 to millions of equations.
In this case, I tested two cases, 10000 and 150000 equations.
Our applicationtakesiterative solve method.
I mean we take reordering and factorization only one time,
and then we solve the equations repetitively as changing right hand side.
Therefore, solve time is very critical in our application.
So we want to reduce solve phase run-time.
As I mentioned above,
I used 4-RHS iteratively because of memory capacity.
So I set the new 4-RHS every solve phase.
Except this method, all things are the same as single RHS.
Regards.
B. Hwang
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Below is my answers.
1) We linked libmkl_intel_thread library.
So we operated reordering & factorization faster than single thread.
2) I set OMP_NUM_THREADS = 4, because our system has 4-core.
3) Positive definite symmetric matrix
I checked just about changing thread library as libmkl_sequential.
The resultwas that single thread(libmkl_sequential) is much faster than
multi threads(libmkl_intel_thread)in phase=33.
I don't understand this situation.
Thanks.
Regards.
B. Hwang
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
DearFedorov.
It is impossible to provide the test set.
But I'm sure it is not a test case problem.
Test cases are justpositive definite symmetric
simple matrices which were provedas normal sets
in case of other solver and PARDISO single RHS mode.
As you already read above articles,
all operationswere normal.
The problem is the speed of solve phase =33
in multiple RHS mode.
Have you tried such like this application?
How was it? Speed of MRHS was faster than single RHS?
Thank you.
Regards.
B. Hwang
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Bosun,
Thanks for your problem you raised. This issue has been submitted to our
internal development tracking database for further investigation, we will
inform you once a new update becomes available.
Here is a bug tracking number for your reference: DPD200190971
Regards, Gennady
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Konstantin.
Thank you for your reply.
I'm afraid it has a problem.
I hope youresolveit ASAP.
And I have a question about increasing of RHS numbers.
Our system has only 4 cores.
In this sytem, is it meaningful to use 16-32 RHS?
Thank you.
Regards.
B. Hwang
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I tested your advise.
But, N > K case is also slower than 1 RHS case.
I hope the problem will be resolved ASAP.
Thankyou.
Regards.
B. Hwang
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Of course, I understood your explanation.
I mean, total run-time of NRHS was
slower than that of 1-RHS.
That is,each 1-RHSrun-time wasn't improved by
increasing of RHS numbers.
In our exmple, 16-RHS test case, the solve time were
1 RHS - 0.023120s
16 RHS - 0.479048s
where 0.023120 * 16 = 0.368s
This means 16RHS slower than 1RHS.
Regards.
B. Hwang
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Our system is Intel Xeon 3GHz x2 c2 x86_64.
Is it right that you wanted to know?
And I have a question!
We used TAUCS solver as our calculation engine.
Despite PARDISO was not working in phase=33 as multi thread,
the PARDISO singleprocessis more twice faster than TAUCS solver.
Could you tell me the reason? Because of optimized compiler?
Or any special algorithms?
Please explain simple reasons about that.
Best regards.
B. Hwang
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
The CPU is
Intel Xeon CPU 5160 @ 3.00GHz
Thank you.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I found outhow I can improve the performance.
I raised the RHS number 16 to 1024 by square of 2.
Above 32 RHS numbers, its speed was going up,
and after 256 RHS numbers, the speed-up was saturated.
I think the optimum RHS numbers is 256 with 4-core in
our case.
Finaly I gotabout 3~5 timesfaster performancethan 1-RHS.
Can you explain this situation?
Do each cores operate asmulti-threading?
And why couldn't the under 32 RHS caseget
the speed up?
Because of the BUG we detected?
Thanks, Konstantin
Regards.
B. Hwang
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page