- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

I am using Intel MKL library to solve a system of linear equations (A*x = b) with multiple right-hand side (rhs) vectors. The rhs vectors are generated asynchronously and through a separate routine and therefore, it is not possible to solve them all at once.

In order to expedite the program, a multi-threaded program is used where each thread is responsible for solving a single rhs vectors. Since the matrix A is always constant, LU factorization should be performed once and the factors are used subsequently in all threads. So, I factor A using following command

dss_factor_real(handle, opt, data);

and pass the handle to the threads to solve the problems using following command:

dss_solve_real(handle, opt, rhs, nRhs, sol);

However, I found out that it is not thread-safe to use the same handle in several instances ofdss_solve_real. Apparently, for some reason, MKL library changes handle in each instance which creates race condition. I read the MKL manual but could not find anything relevant. Since it is not logical to factorize A for each thread, I am wondering if there is any way to overcome this problem and use the same handle everywhere.

Thanks in advance for your help

Link Copied

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

How are you managing the threads, or are you letting MKL manage them? Note that, for a dense *n* X *n* matrix, the L-U factorization costs roughly the same work as subsequently computing the solutions for *n* right hand sides. The subdivision of work among multiple threads is different in nature for the factorization and solution phases.

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

Thanks for your reply.

The threads are managed by me. In order to understand what I'm doing, consider following two lines:

dss_solve_real(handle, opt, rhs1, nRhs, sol2);

dss_solve_real(handle, opt, rhs2, nRhs, sol2);

These lines are using same handle (which represent the LU factorization for the same matrix) to solve the forward/backward substitution for two rhs vectors (rhs1 and rhs2) and the results are supposed to be saved in (sol1 and sol2).

Unfortunately, executing these lines simultaneously in two threads ruins everything. However, if I use two handles, everything is fine. What I do not understand is why dss_solve_real changes handle data. Please correct me if I am wrong. Thanks

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

**dss_solve_real**( handle, opt, rhs1, nRhs,

**sol2**); >> >>

**dss_solve_real**( handle, opt, rhs2, nRhs,

**sol2**); >> >>These lines are using same handle (which represent the LU factorization for the same matrix) to solve the forward/backward >>substitution for two rhs vectors (rhs1 and rhs2) and the results are supposed to be saved in (

**sol1**and

**sol2**). This is simply to let you know that

**sol2**is used in both cases according to your example.

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

Sorry, but it was just a typo. The real code looks like:

dss_solve_real(handle, opt, rhs1, nRhs, sol1);

dss_solve_real(handle, opt, rhs2, nRhs, sol2);

I checked the dss_solve_real with a simple code and found out that it is not possible to use same handle in two instances of dss_solve_real. I do not know what to do as it does not make sense to perform same LU factorization several times.

Any help is appreciated.

Thanks

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

Hi Pouya,

How do you pass the handle to the sub-pthread, as shared variable or locate varaible? The handle should contains the information of dss factor and dss solve internal result. So it is expected to be changed after one dss solver call. But it seems if for reuse in different thread, we need add more function, like dss_handle_copy.

Best Regards,

Ying

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

Ying H (Intel) wrote:

Hi Pouya,

How do you pass the handle to the sub-pthread, as shared variable or locate varaible? The handle should contains the information of dss factor and dss solve complete. So it is expected to be changed after one dss solver. But it seems if for reuse in different thread, we need add more function, like dss_handle_copy.

Best Regards,

Ying

Hi Ying,

Thanks for your comment.

I pass the handle to threads as shared variables. In fact, I have defined following struct:

typedef struct {

double* rhsVector;

size_t lenRHSVector;

_MKL_DSS_HANDLE_t handle;

} threadData_t;

and for each thread, I pass the thread data as:

threadData_t threadData[4];

for(k = 0; k < 4; k++){

double* rhsVector;

size_t lenRHSVector;

threadData

threadData

threadData

}

As you can see, I am passing the same handle to all threads and this is what creates problem.

In order to overcome the problem, I was looking for a function to make a copy of handle and pass the copy of handle to each thread. However, I could not find such function. So, I think it would be a good idea if Intel can add a dss_handle_copy function to its MKL library.

Thanks for your time and attention.

Pouya

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

Hi Pouya,

I get reply from MKL DSS manager.

PARDISO/DSS handle is not thread safe, so in general it cannot be used in the way. Moreover, copying handle will not help – it will point to the same internal data structures with the same result as using the same handle in different threads (or maybe even worse, some crashes may happen).

There is a potential workaround, however. The user might try to use PARDISO API without iterative refinement (DSS API always has 2 iterative refinement steps). This might work although there are some ‘bad’ cases (like ill-conditioned positive definite matrices), when iterative refinement will be turned on anyway and this approach will fail the same way as described on the forum. However, if it works, it would be working the way the user wants it to work.

Another workaround may consist of modifying the algorithm a bit. As soon as the code generates the first rhs, it calls DSS solve while continue gathering the subsequent rhs in a temporary buffer. When the first solve step is done, it calls DSS solve again with the rhs gathered by that time in the buffer and so on. This approach would require some thread balancing work, but might work reasonably well in some cases and it is guaranteed to work correctly unlike the first workaround.

In any case, it would be of value for us to know the details of the problem that results in asynchronous generation of rhs. If it really requires asynchronous rhs generation, a feature request can be generated to support this case.

Best Regards,

Ying

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

Ying H (Intel) wrote:

Hi Pouya,

I get reply from MKL DSS manager.

PARDISO/DSS handle is not thread safe, so in general it cannot be used in the way. Moreover, copying handle will not help – it will point to the same internal data structures with the same result as using the same handle in different threads (or maybe even worse, some crashes may happen).

There is a potential workaround, however. The user might try to use PARDISO API without iterative refinement (DSS API always has 2 iterative refinement steps). This might work although there are some ‘bad’ cases (like ill-conditioned positive definite matrices), when iterative refinement will be turned on anyway and this approach will fail the same way as described on the forum. However, if it works, it would be working the way the user wants it to work.

Another workaround may consist of modifying the algorithm a bit. As soon as the code generates the first rhs, it calls DSS solve while continue gathering the subsequent rhs in a temporary buffer. When the first solve step is done, it calls DSS solve again with the rhs gathered by that time in the buffer and so on. This approach would require some thread balancing work, but might work reasonably well in some cases and it is guaranteed to work correctly unlike the first workaround.

In any case, it would be of value for us to know the details of the problem that results in asynchronous generation of rhs. If it really requires asynchronous rhs generation, a feature request can be generated to support this case.

Best Regards,

Ying

Dear Ying,

Thanks for your reply. In the meanwhile, I tried the solutions offered by DSS manager. I turned off Refinement but it did not solve the problem. Even with MKL_DSS_REFINEMENT_OFF, I am getting random results and sometimes it gives the same results in all threads for different rhs vectors.

For the second option, I am currently buffering the rhs vectors. But if I want to buffer the rhs vectors to solve them simultaneously, that requires considerable amount of effort to create an optimized code. So, this one is my last option.

FYI, I am currently creating three handles separately and in parallel, which may sound funny. The time consumed for LU factorization is not of great concern. However, the main challenge is that LU factorizations fill up L3 cache very quickly which results in the performance reduction. Considering this problem, a simple dss_handle_copy will not solve the problem since there will be redundent copies of handle in the cache. If there was a way to force dss_solve_real to save the results directly into solution vector (and not changing handle), that would definitely solve the problem.

Again thanks for your comments.

Pouya

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

Hi Pouya,

I wonder what kind of problem you is trying to solve if it fits L3 cache of size of a few MB. ( As i undertand, you mentioned, L3 cache are filled up very quickly, do you mean computation is not bottleneck, but the memory wrote, right?). We are also interested in how the rhs is produced asynchronously. Could you please provide us a test case?

if the code is private, you can send me a message.

Best Regards,

Ying

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page