Question about multiple RHS

Bosun_Hwang · ‎08-17-2010

Dear All.

Itook the implementation of PARDISO mrhs version
into our application.
I followed Kalinkin's instruction.
The solver was operated normaly and resultswere correct.
But problem issolving time.
In phase=33, with single rhs case
the solve time was 0.03450 s.
But, with 4-rhs case
the solve time was 0.532629 s.
I expected the almost same run-time,
however 4-rhs case was 15 times slower than
single rhs.
I heard that we canreduce run-time using multiple rhs.
I tested some other test cases, but results were similar.

What are the overhead factors influenced such results?
Is it right thatthe PARDISO provides the multi threading in phase=33 solve part?

Konstantin_A_Intel · ‎08-17-2010

Hi Bosun,

May you provide a bit more details about your configuration?

Namely, what is your MKL version, OS and processor? How many cores is in your system? It's also good to know the number of equations in your task.

With this information we will be able to reproduce the situation and provide you with appropriate advise.

Thanks,

Konstantin

Bosun_Hwang · ‎08-18-2010

Dear Konstantin.

MKL version is the latest version and
O/S is Redhat_AS4_U7,
CPU is Intel Xeon 3GHz 4-core
Memory is 64Gb

Thank you.

Regards.
B. Hwang

Konstantin_A_Intel · ‎08-18-2010

Ok, thank you! And what is the number of equations? (roughly.. 100, 1000, 10000)

I'll make a couple of runs of similar task and will update you.

Regards,

Konstantin

Bosun_Hwang · ‎08-18-2010

Dear Konstantin

We have many cases, 10000 to millions of equations.
In this case, I tested two cases, 10000 and 150000 equations.

Our applicationtakesiterative solve method.
I mean we take reordering and factorization only one time,
and then we solve the equations repetitively as changing right hand side.
Therefore, solve time is very critical in our application.
So we want to reduce solve phase run-time.

As I mentioned above,
I used 4-RHS iteratively because of memory capacity.
So I set the new 4-RHS every solve phase.
Except this method, all things are the same as single RHS.

Regards.
B. Hwang

Konstantin_A_Intel · ‎08-18-2010

Hi,

I have a few more questions to have our input conditions 'aligned':

1) Did you link against libmkl_intel_thread library, not libmkl_sequential?

2) Did you set OMP_NUM_THREADS to any value?

3) Which type of matrix did you use in your test?

Thanks,

Konstantin

Bosun_Hwang · ‎08-18-2010

Dear Konstantin

Below is my answers.
1) We linked libmkl_intel_thread library.
So we operated reordering & factorization faster than single thread.
2) I set OMP_NUM_THREADS = 4, because our system has 4-core.
3) Positive definite symmetric matrix

I checked just about changing thread library as libmkl_sequential.
The resultwas that single thread(libmkl_sequential) is much faster than
multi threads(libmkl_intel_thread)in phase=33.
I don't understand this situation.

Thanks.

Regards.
B. Hwang

Gennady_F_Intel · ‎08-18-2010

Bosun,

I think the fastest way resolving/reproducing this issue isto provide for us the test case with the input data you are encountering this problem with.

--Gennady

Bosun_Hwang · ‎08-18-2010

DearFedorov.

It is impossible to provide the test set.
But I'm sure it is not a test case problem.
Test cases are justpositive definite symmetric
simple matrices which were provedas normal sets
in case of other solver and PARDISO single RHS mode.
As you already read above articles,
all operationswere normal.
The problem is the speed of solve phase =33
in multiple RHS mode.

Have you tried such like this application?
How was it? Speed of MRHS was faster than single RHS?

Thank you.

Regards.
B. Hwang

Konstantin_A_Intel · ‎08-19-2010

Hi Bosun,

Indeed, we reproduced the problem you described: performance of parallel solution phase with a few RHS is low. We will investigate the problem and try to fix it ASAP.

You may try to inprove performance of parallel solve phase by increasing the number of RHS (if possible) to 16-32. I hope this will let you reduce the time per computation of 1 RHS.

Thank you,

Konstantin

Gennady_F_Intel · ‎08-19-2010

Hi Bosun,
Thanks for your problem you raised. This issue has been submitted to our internal development tracking database for further investigation, we will inform you once a new update becomes available.
Here is a bug tracking number for your reference: DPD200190971
Regards, Gennady

Bosun_Hwang · ‎08-19-2010

Hi Konstantin.

Thank you for your reply.
I'm afraid it has a problem.
I hope youresolveit ASAP.

And I have a question about increasing of RHS numbers.
Our system has only 4 cores.
In this sytem, is it meaningful to use 16-32 RHS?

Thank you.

Regards.
B. Hwang

Konstantin_A_Intel · ‎08-20-2010

Hi Bosun,

Of course, solving more RHS is meaningful because parallel solve algorithm (where N RHS is just splitted via K threads) works better when N > K ("better" means that computational time per 1 RHS decreased).I would say, in this case theading overhead is not significant as far as each process has more work to do.

And it's only a question of your algorithm: how many independent RHS it has to solve at once? If it can be 16-32 or even more: please try it.

Regards,

Konstantin

Bosun_Hwang · ‎08-20-2010

Hi Konstantin

I tested your advise.
But, N > K case is also slower than 1 RHS case.
I hope the problem will be resolved ASAP.

Thankyou.

Regards.
B. Hwang

Konstantin_A_Intel · ‎08-22-2010

Hi Bosun,

What do you mean saying: "But, N > K case is also slower than 1 RHS case."? Do you mean that a time per 1 RHS is larger in case of many RHS, i.e.: time_1RHS < time_NRHS/N

Of course, solving N RHS cannot be faster than 1 RHS, but the time per 1 RHS (one right hand side) is better.

Here's an example:

I used Linux server similar to yours, and set OMP_NUM_THREADS=4:

1 RHS - 0.033 sec

32 RHS - 0.175 sec

In other words, in the second case I got 0.0055 sec per 1 RHS, or ~6x scalability. Note, that such a good scalability was achieved because more efficient level-3 BLAS was used in the implementation of NRHS solving phase (N RHS vectors are treated as a matrix).

Regards,

Konstantin

Bosun_Hwang · ‎08-22-2010

Hi Konstantin.

Of course, I understood your explanation.
I mean, total run-time of NRHS was
slower than that of 1-RHS.
That is,each 1-RHSrun-time wasn't improved by
increasing of RHS numbers.
In our exmple, 16-RHS test case, the solve time were
1 RHS - 0.023120s
16 RHS - 0.479048s
where 0.023120 * 16 = 0.368s
This means 16RHS slower than 1RHS.

Regards.
B. Hwang

Konstantin_A_Intel · ‎08-22-2010

Hi Bosun,

Ok, I'm glad that we both use the same terminology: all is clear here now, thanks.

As I mentioned, my times are a bit better (with 16, 32 RHS, at least, I observe scalability from 2x to 6x in comparison with 1 RHS).

Would you please specify an exact name of processor (like Intel Xeon CPU 5160) in order I can reproduce situation when NRHS=16 is slower than 1 RHS?

P.S. As Gennady mentioned, we work on the trackerDPD200190971 re slow solve phase with small RHS number.

Regards,

Konstantin

Bosun_Hwang · ‎08-23-2010

Hi Konstantin.

Our system is Intel Xeon 3GHz x2 c2 x86_64.
Is it right that you wanted to know?

And I have a question!
We used TAUCS solver as our calculation engine.
Despite PARDISO was not working in phase=33 as multi thread,
the PARDISO singleprocessis more twice faster than TAUCS solver.

Could you tell me the reason? Because of optimized compiler?
Or any special algorithms?
Please explain simple reasons about that.

Best regards.
B. Hwang

Konstantin_A_Intel · ‎08-23-2010

Hi Bosun,

Could you send me the output of this command (or attach)?

dmesg | grep CPU | sort -u

Re performance difference with TAUCS - I do not know the exact reason. Probably, it comes from more efficient matrix-vector operations which is implemented in MKL BLAS and are used in phase=33. But I'm not 100% sure.

Regards,

Konstantin

Bosun_Hwang · ‎08-23-2010

Hi Konstantin

The CPU is

Intel Xeon CPU 5160 @ 3.00GHz

Thank you.

Bosun_Hwang · ‎08-24-2010

Hi Konstantin

I found outhow I can improve the performance.
I raised the RHS number 16 to 1024 by square of 2.
Above 32 RHS numbers, its speed was going up,
and after 256 RHS numbers, the speed-up was saturated.
I think the optimum RHS numbers is 256 with 4-core in
our case.
Finaly I gotabout 3~5 timesfaster performancethan 1-RHS.

Can you explain this situation?
Do each cores operate asmulti-threading?

And why couldn't the under 32 RHS caseget
the speed up?
Because of the BUG we detected?

Thanks, Konstantin

Regards.
B. Hwang