- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

Itook the implementation of PARDISO mrhs version

into our application.

I followed Kalinkin's instruction.

The solver was operated normaly and resultswere correct.

But problem issolving time.

In phase=33, with single rhs case

the solve time was 0.03450 s.

But, with 4-rhs case

the solve time was 0.532629 s.

I expected the almost same run-time,

however 4-rhs case was 15 times slower than

single rhs.

I heard that we canreduce run-time using multiple rhs.

I tested some other test cases, but results were similar.

What are the overhead factors influenced such results?

Is it right thatthe PARDISO provides the multi threading in phase=33 solve part?

Link Copied

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

Dear Konstantin.

MKL version is the latest version and

O/S is Redhat_AS4_U7,

CPU is Intel Xeon 3GHz 4-core

Memory is 64Gb

Thank you.

Regards.

B. Hwang

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

We have many cases, 10000 to millions of equations.

In this case, I tested two cases, 10000 and 150000 equations.

Our applicationtakesiterative solve method.

I mean we take reordering and factorization only one time,

and then we solve the equations repetitively as changing right hand side.

Therefore, solve time is very critical in our application.

So we want to reduce solve phase run-time.

As I mentioned above,

I used 4-RHS iteratively because of memory capacity.

So I set the new 4-RHS every solve phase.

Except this method, all things are the same as single RHS.

Regards.

B. Hwang

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

Below is my answers.

1) We linked libmkl_intel_thread library.

So we operated reordering & factorization faster than single thread.

2) I set OMP_NUM_THREADS = 4, because our system has 4-core.

3) Positive definite symmetric matrix

I checked just about changing thread library as libmkl_sequential.

The resultwas that single thread(libmkl_sequential) is much faster than

multi threads(libmkl_intel_thread)in phase=33.

I don't understand this situation.

Thanks.

Regards.

B. Hwang

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

DearFedorov.

It is impossible to provide the test set.

But I'm sure it is not a test case problem.

Test cases are justpositive definite symmetric

simple matrices which were provedas normal sets

in case of other solver and PARDISO single RHS mode.

As you already read above articles,

all operationswere normal.

The problem is the speed of solve phase =33

in multiple RHS mode.

Have you tried such like this application?

How was it? Speed of MRHS was faster than single RHS?

Thank you.

Regards.

B. Hwang

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

Hi Bosun,

Thanks for your problem you raised. This issue has been submitted to our
internal development tracking database for further investigation, we will
inform you once a new update becomes available.

Here is a bug tracking number for your reference: DPD200190971

Regards, Gennady

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

Hi Konstantin.

Thank you for your reply.

I'm afraid it has a problem.

I hope youresolveit ASAP.

And I have a question about increasing of RHS numbers.

Our system has only 4 cores.

In this sytem, is it meaningful to use 16-32 RHS?

Thank you.

Regards.

B. Hwang

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

I tested your advise.

But, N > K case is also slower than 1 RHS case.

I hope the problem will be resolved ASAP.

Thankyou.

Regards.

B. Hwang

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

Of course, I understood your explanation.

I mean, total run-time of NRHS was

slower than that of 1-RHS.

That is,each 1-RHSrun-time wasn't improved by

increasing of RHS numbers.

In our exmple, 16-RHS test case, the solve time were

1 RHS - 0.023120s

16 RHS - 0.479048s

where 0.023120 * 16 = 0.368s

This means 16RHS slower than 1RHS.

Regards.

B. Hwang

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

Our system is Intel Xeon 3GHz x2 c2 x86_64.

Is it right that you wanted to know?

And I have a question!

We used TAUCS solver as our calculation engine.

Despite PARDISO was not working in phase=33 as multi thread,

the PARDISO singleprocessis more twice faster than TAUCS solver.

Could you tell me the reason? Because of optimized compiler?

Or any special algorithms?

Please explain simple reasons about that.

Best regards.

B. Hwang

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

The CPU is

Intel Xeon CPU 5160 @ 3.00GHz

Thank you.

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

I found outhow I can improve the performance.

I raised the RHS number 16 to 1024 by square of 2.

Above 32 RHS numbers, its speed was going up,

and after 256 RHS numbers, the speed-up was saturated.

I think the optimum RHS numbers is 256 with 4-core in

our case.

Finaly I gotabout 3~5 timesfaster performancethan 1-RHS.

Can you explain this situation?

Do each cores operate asmulti-threading?

And why couldn't the under 32 RHS caseget

the speed up?

Because of the BUG we detected?

Thanks, Konstantin

Regards.

B. Hwang

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page