Intel® oneAPI Math Kernel Library
Ask questions and share information with other developers who use Intel® Math Kernel Library.

Question about multiple RHS

Bosun_Hwang
初学者
7,803 次查看
Dear All.

Itook the implementation of PARDISO mrhs version
into our application.
I followed Kalinkin's instruction.
The solver was operated normaly and resultswere correct.
But problem issolving time.
In phase=33, with single rhs case
the solve time was 0.03450 s.
But, with 4-rhs case
the solve time was 0.532629 s.
I expected the almost same run-time,
however 4-rhs case was 15 times slower than
single rhs.
I heard that we canreduce run-time using multiple rhs.
I tested some other test cases, but results were similar.

What are the overhead factors influenced such results?
Is it right thatthe PARDISO provides the multi threading in phase=33 solve part?
0 项奖励
28 回复数
Konstantin_A_Intel
2,666 次查看
Hello Bosun,
I'm glad that you achieved good performance and hope it will be a workaround for you for a while.As I mentioned, increasing NRHS should improve scalability to make it (theoretically) closer and closer to optimal: 4x for 4 cores.
And I want to note that we also improved performance for small numbers of NRHS (4-8). We will notify you in which MKL version the fix will be available. However, you should understand that small NRHS (in general) is less efficient than relatively big NRHS.
Regards,
Konstantin
0 项奖励
Bosun_Hwang
初学者
2,666 次查看

Dear Konstantin

Thank you for your reply.
And I want to ask one more question.
How is the performance getting better more
4 times(in our case 5.3 times better)with 4 cores?
It should be under 4 times?
My co-workers asked me about it, but I couldn't answer exactly.
Can you explain the detailed reasons?

Best regards.
B. Hwang

0 项奖励
Konstantin_A_Intel
2,666 次查看
It's possible as far as level-3 BLAS (matrix-matrix operations) is used in MKL in case of many RHS instead of level-2 BLAS for 1 RHS (matrix-vector ops). And it's known that level-3 efficiency could be almost 100% of HW peak due to the dominance of fp-operations in comparison with memory operations. At the same time, level-2 BLAS (e.g. DGEMV) has about the same number of fp- and memory- ops and can be less efficient if not all data resides in cache due to significant latency of memory operations.
Most likely, it's the reason of super-scalabilty of NRHS solve in some cases.
Regards,
Konstantin
0 项奖励
Bosun_Hwang
初学者
2,666 次查看
You mean, because of the differenceof memory efficiencyand BLAS library?
That is, becauses we have to allocate and free memory every solve when 1-RHS?
Is it right that I understood.

Regards.
B. Hwang
0 项奖励
Konstantin_A_Intel
2,666 次查看
>> That is, becauses we have to allocate and free memory every solve when 1-RHS?
Not exactly. When we solve 1-RHS, we consider RHS as vector, and use matrix-vector (MV) operations to compute forward and backward substitutions (solve phase). In case of N-RHS, we operate with RHS as with matrix and thus use matrix-matrix (MM) operations.
MM operations are more efficient than MV operations theoretically (and practically) in modern computers as far as memory operations are usually more expensive (time-consuming) than float-point operations (each core has the same number of fp-units, but memory bandwidth is often limited/shared between cores, they also share caches and so on). MM product consists of ~N^2 memory ops, and ~N^3 fp ops. MV has respectively ~N^2 and ~N^2. So, we can conclude that MM product not depends so much on memory operation (if computations are implemented in optimal way as it's done in MKL). But MV product is limited by memory bandwidth much more.
Regards,
Konstantin
0 项奖励
Bosun_Hwang
初学者
2,666 次查看
Dear Konstantin

I'm appreciating youfor your cooperation,
now I'm very happy due to our experimental results.
Your advices were very helpful for our job,
and so we could achieve the our goal.

Thank youso much Konstantin!


Best regards.
Bosun Hwang
0 项奖励
Konstantin_A_Intel
2,666 次查看
Dear Bosun,
I also want to thank you for using MKL and for your active participation on the forum and reporting your problems to us! We're always ready to help!
Best regards,
Konstantin
0 项奖励
Gennady_F_Intel
主持人
2,666 次查看
Bosun Hwang, We did some improvements in 10,2. Update 7. Could you please check how it works on your side and let us know the results.
--Gennady
0 项奖励
回复