Auto-suggest helps you quickly narrow down your search results by suggesting possible matches as you type.

Showing results for

- Intel Community
- Software
- Software Development SDKs and Libraries
- Intel® oneAPI Math Kernel Library
- Question about multiple RHS

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Mute
- Printer Friendly Page

Bosun_Hwang

Beginner

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

08-17-2010
11:20 PM

258 Views

Question about multiple RHS

Itook the implementation of PARDISO mrhs version

into our application.

I followed Kalinkin's instruction.

The solver was operated normaly and resultswere correct.

But problem issolving time.

In phase=33, with single rhs case

the solve time was 0.03450 s.

But, with 4-rhs case

the solve time was 0.532629 s.

I expected the almost same run-time,

however 4-rhs case was 15 times slower than

single rhs.

I heard that we canreduce run-time using multiple rhs.

I tested some other test cases, but results were similar.

What are the overhead factors influenced such results?

Is it right thatthe PARDISO provides the multi threading in phase=33 solve part?

Link Copied

28 Replies

Konstantin_A_Intel

Employee

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

08-17-2010
11:50 PM

212 Views

May you provide a bit more details about your configuration?

Namely, what is your MKL version, OS and processor? How many cores is in your system? It's also good to know the number of equations in your task.

With this information we will be able to reproduce the situation and provide you with appropriate advise.

Thanks,

Konstantin

Bosun_Hwang

Beginner

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

08-18-2010
12:08 AM

212 Views

MKL version is the latest version and

O/S is Redhat_AS4_U7,

CPU is Intel Xeon 3GHz 4-core

Memory is 64Gb

Thank you.

Regards.

B. Hwang

Konstantin_A_Intel

Employee

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

08-18-2010
12:20 AM

212 Views

I'll make a couple of runs of similar task and will update you.

Regards,

Konstantin

Bosun_Hwang

Beginner

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

08-18-2010
12:43 AM

212 Views

We have many cases, 10000 to millions of equations.

In this case, I tested two cases, 10000 and 150000 equations.

Our applicationtakesiterative solve method.

I mean we take reordering and factorization only one time,

and then we solve the equations repetitively as changing right hand side.

Therefore, solve time is very critical in our application.

So we want to reduce solve phase run-time.

As I mentioned above,

I used 4-RHS iteratively because of memory capacity.

So I set the new 4-RHS every solve phase.

Except this method, all things are the same as single RHS.

Regards.

B. Hwang

Konstantin_A_Intel

Employee

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

08-18-2010
02:51 AM

212 Views

I have a few more questions to have our input conditions 'aligned':

1) Did you link against libmkl_intel_thread library, not libmkl_sequential?

2) Did you set OMP_NUM_THREADS to any value?

3) Which type of matrix did you use in your test?

Thanks,

Konstantin

Bosun_Hwang

Beginner

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

08-18-2010
06:14 PM

212 Views

Below is my answers.

1) We linked libmkl_intel_thread library.

So we operated reordering & factorization faster than single thread.

2) I set OMP_NUM_THREADS = 4, because our system has 4-core.

3) Positive definite symmetric matrix

I checked just about changing thread library as libmkl_sequential.

The resultwas that single thread(libmkl_sequential) is much faster than

multi threads(libmkl_intel_thread)in phase=33.

I don't understand this situation.

Thanks.

Regards.

B. Hwang

Gennady_F_Intel

Moderator

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

08-18-2010
09:42 PM

212 Views

I think the fastest way resolving/reproducing this issue isto provide for us the test case with the input data you are encountering this problem with.

--Gennady

Bosun_Hwang

Beginner

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

08-18-2010
09:51 PM

212 Views

It is impossible to provide the test set.

But I'm sure it is not a test case problem.

Test cases are justpositive definite symmetric

simple matrices which were provedas normal sets

in case of other solver and PARDISO single RHS mode.

As you already read above articles,

all operationswere normal.

The problem is the speed of solve phase =33

in multiple RHS mode.

Have you tried such like this application?

How was it? Speed of MRHS was faster than single RHS?

Thank you.

Regards.

B. Hwang

Konstantin_A_Intel

Employee

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

08-19-2010
09:52 PM

212 Views

Indeed, we reproduced the problem you described: performance of parallel solution phase with a few RHS is low. We will investigate the problem and try to fix it ASAP.

You may try to inprove performance of parallel solve phase by increasing the number of RHS (if possible) to 16-32. I hope this will let you reduce the time per computation of 1 RHS.

Thank you,

Konstantin

Gennady_F_Intel

Moderator

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

08-19-2010
10:13 PM

212 Views

Thanks for your problem you raised. This issue has been submitted to our
internal development tracking database for further investigation, we will
inform you once a new update becomes available.

Here is a bug tracking number for your reference: DPD200190971

Regards, Gennady

Bosun_Hwang

Beginner

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

08-19-2010
10:19 PM

212 Views

Thank you for your reply.

I'm afraid it has a problem.

I hope youresolveit ASAP.

And I have a question about increasing of RHS numbers.

Our system has only 4 cores.

In this sytem, is it meaningful to use 16-32 RHS?

Thank you.

Regards.

B. Hwang

Konstantin_A_Intel

Employee

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

08-20-2010
02:57 AM

212 Views

Hi Bosun,

Of course, solving
more RHS is meaningful because parallel solve algorithm (where N RHS is just
splitted via K threads) works better when N > K ("better" means that
computational time per 1 RHS decreased).I would say, in this case theading
overhead is not significant as far as each process has more work to do.

And it's only a
question of your algorithm: how many independent RHS it has to solve at once? If
it can be 16-32 or even more: please try
it.

Regards,

Konstantin

Bosun_Hwang

Beginner

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

08-20-2010
10:37 PM

212 Views

I tested your advise.

But, N > K case is also slower than 1 RHS case.

I hope the problem will be resolved ASAP.

Thankyou.

Regards.

B. Hwang

Konstantin_A_Intel

Employee

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

08-22-2010
01:03 AM

212 Views

What do you mean saying: "But, N > K case is also slower than 1 RHS case."? Do you mean that a time per 1 RHS is larger in case of many RHS, i.e.: time_1RHS < time_NRHS/N

Of course, solving N RHS cannot be faster than 1 RHS, but the time per 1 RHS (one right hand side) is better.

Here's an example:

I used Linux server similar to yours, and set OMP_NUM_THREADS=4:

1 RHS - 0.033 sec

32 RHS - 0.175 sec

In other words, in the second case I got 0.0055 sec per 1 RHS, or ~6x scalability. Note, that such a good scalability was achieved because more efficient level-3 BLAS was used in the implementation of NRHS solving phase (N RHS vectors are treated as a matrix).

Regards,

Konstantin

Bosun_Hwang

Beginner

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

08-22-2010
11:23 PM

212 Views

Of course, I understood your explanation.

I mean, total run-time of NRHS was

slower than that of 1-RHS.

That is,each 1-RHSrun-time wasn't improved by

increasing of RHS numbers.

In our exmple, 16-RHS test case, the solve time were

1 RHS - 0.023120s

16 RHS - 0.479048s

where 0.023120 * 16 = 0.368s

This means 16RHS slower than 1RHS.

Regards.

B. Hwang

Konstantin_A_Intel

Employee

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

08-22-2010
11:57 PM

212 Views

Hi Bosun,

Ok, I'm glad that we both use the same terminology: all is clear here now, thanks.

As I mentioned, my times are a bit better (with 16, 32 RHS, at least, I observe scalability from 2x to 6x in comparison with 1 RHS).

Would you please specify an exact name of processor (like Intel Xeon CPU 5160) in order I can reproduce situation when NRHS=16 is slower than 1 RHS?

P.S. As Gennady mentioned, we work on the trackerDPD200190971 re slow solve phase with small RHS number.

Regards,

Konstantin

Bosun_Hwang

Beginner

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

08-23-2010
01:58 AM

212 Views

Our system is Intel Xeon 3GHz x2 c2 x86_64.

Is it right that you wanted to know?

And I have a question!

We used TAUCS solver as our calculation engine.

Despite PARDISO was not working in phase=33 as multi thread,

the PARDISO singleprocessis more twice faster than TAUCS solver.

Could you tell me the reason? Because of optimized compiler?

Or any special algorithms?

Please explain simple reasons about that.

Best regards.

B. Hwang

Konstantin_A_Intel

Employee

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

08-23-2010
02:53 AM

212 Views

Could you send me the output of this command (or attach)?

dmesg | grep CPU | sort -u

Re performance difference with TAUCS - I do not know the exact reason. Probably, it comes from more efficient matrix-vector operations which is implemented in MKL BLAS and are used in phase=33. But I'm not 100% sure.

Regards,

Konstantin

Bosun_Hwang

Beginner

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

08-23-2010
06:03 PM

212 Views

Hi Konstantin

The CPU is

Intel Xeon CPU 5160 @ 3.00GHz

Thank you.

The CPU is

Intel Xeon CPU 5160 @ 3.00GHz

Thank you.

Bosun_Hwang

Beginner

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

08-24-2010
08:45 PM

74 Views

I found outhow I can improve the performance.

I raised the RHS number 16 to 1024 by square of 2.

Above 32 RHS numbers, its speed was going up,

and after 256 RHS numbers, the speed-up was saturated.

I think the optimum RHS numbers is 256 with 4-core in

our case.

Finaly I gotabout 3~5 timesfaster performancethan 1-RHS.

Can you explain this situation?

Do each cores operate asmulti-threading?

And why couldn't the under 32 RHS caseget

the speed up?

Because of the BUG we detected?

Thanks, Konstantin

Regards.

B. Hwang

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page

For more complete information about compiler optimizations, see our Optimization Notice.