Question about multiple RHS - Page 2

Bosun_Hwang · ‎08-17-2010

Dear All.

Itook the implementation of PARDISO mrhs version
into our application.
I followed Kalinkin's instruction.
The solver was operated normaly and resultswere correct.
But problem issolving time.
In phase=33, with single rhs case
the solve time was 0.03450 s.
But, with 4-rhs case
the solve time was 0.532629 s.
I expected the almost same run-time,
however 4-rhs case was 15 times slower than
single rhs.
I heard that we canreduce run-time using multiple rhs.
I tested some other test cases, but results were similar.

What are the overhead factors influenced such results?
Is it right thatthe PARDISO provides the multi threading in phase=33 solve part?

Konstantin_A_Intel · ‎08-24-2010

Hello Bosun,

I'm glad that you achieved good performance and hope it will be a workaround for you for a while.As I mentioned, increasing NRHS should improve scalability to make it (theoretically) closer and closer to optimal: 4x for 4 cores.

And I want to note that we also improved performance for small numbers of NRHS (4-8). We will notify you in which MKL version the fix will be available. However, you should understand that small NRHS (in general) is less efficient than relatively big NRHS.

Regards,

Konstantin

Bosun_Hwang · ‎08-24-2010

Dear Konstantin

Thank you for your reply.
And I want to ask one more question.
How is the performance getting better more
4 times(in our case 5.3 times better)with 4 cores?
It should be under 4 times?
My co-workers asked me about it, but I couldn't answer exactly.
Can you explain the detailed reasons?

Best regards.
B. Hwang

Konstantin_A_Intel · ‎08-24-2010

It's possible as far as level-3 BLAS (matrix-matrix operations) is used in MKL in case of many RHS instead of level-2 BLAS for 1 RHS (matrix-vector ops). And it's known that level-3 efficiency could be almost 100% of HW peak due to the dominance of fp-operations in comparison with memory operations. At the same time, level-2 BLAS (e.g. DGEMV) has about the same number of fp- and memory- ops and can be less efficient if not all data resides in cache due to significant latency of memory operations.

Most likely, it's the reason of super-scalabilty of NRHS solve in some cases.

Regards,

Konstantin

Bosun_Hwang · ‎08-24-2010

You mean, because of the differenceof memory efficiencyand BLAS library?
That is, becauses we have to allocate and free memory every solve when 1-RHS?
Is it right that I understood.

Regards.
B. Hwang

Konstantin_A_Intel · ‎08-24-2010

>> That is, becauses we have to allocate and free memory every solve when 1-RHS?

Not exactly. When we solve 1-RHS, we consider RHS as vector, and use matrix-vector (MV) operations to compute forward and backward substitutions (solve phase). In case of N-RHS, we operate with RHS as with matrix and thus use matrix-matrix (MM) operations.

MM operations are more efficient than MV operations theoretically (and practically) in modern computers as far as memory operations are usually more expensive (time-consuming) than float-point operations (each core has the same number of fp-units, but memory bandwidth is often limited/shared between cores, they also share caches and so on). MM product consists of ~N^2 memory ops, and ~N^3 fp ops. MV has respectively ~N^2 and ~N^2. So, we can conclude that MM product not depends so much on memory operation (if computations are implemented in optimal way as it's done in MKL). But MV product is limited by memory bandwidth much more.

Regards,

Konstantin

Bosun_Hwang · ‎08-25-2010

Dear Konstantin

I'm appreciating youfor your cooperation,
now I'm very happy due to our experimental results.
Your advices were very helpful for our job,
and so we could achieve the our goal.

Thank youso much Konstantin!

Best regards.
Bosun Hwang

Konstantin_A_Intel · ‎08-25-2010

Dear Bosun,

I also want to thank you for using MKL and for your active participation on the forum and reporting your problems to us! We're always ready to help!

Best regards,

Konstantin

Gennady_F_Intel · ‎12-18-2010

Bosun Hwang, We did some improvements in 10,2. Update 7. Could you please check how it works on your side and let us know the results.

--Gennady