- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I need to solve a system of linear equations with three righ-hand-side vectors. Initially, I was using the sequential version of MKL (compiling with "libmkl_sequential.a") and solving for each rhs vector sequentially as:
dss_solve_real(DSS_handle, solOpt, rhs1, 1, x1);
dss_solve_real(DSS_handle, solOpt, rhs1 + numOfVars, 1, x1 + numOfVars);
dss_solve_real(DSS_handle, solOpt, rhs1 + 2*numOfVars, 1, x1 + 2*numOfVars);
where, numOfVars represent number of variables.
Then, I decided to ask dss_solve_real to solve for all rhs vectors at once and I assumed that it will roughly lead to 3 times improvement. So, I compiled the code using "libmkl_intel_thread.a" and used following code:
dss_solve_real(DSS_handle, solOpt, rhs1, 3, x1);
In my surprise, the timing is very wierd. Sequential version takes 0.548 sec while when I want to solve for all rhs vectors at once, it takes 5.024sec, which is almost 10 times more than sequential version.
I feel there is something wrong here and I may be needed to set some environment variables. So, please let me know if you have similar experience.
Any help is appreciated.
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
what is the problem size? and did you measure the dss_solve_real stage only?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
The matrix is 3408 * 3408 and I am just measuring the execution time of dss_solve_real.
I set KMP_AFFINITY=verbose to monitor thread creation and apparently mkl creates 4 threads. Here's the output:
OMP: Info #147: KMP_AFFINITY: Internal thread 4 bound to OS proc set {0,1,2,3,4,5,6,7}
OMP: Info #147: KMP_AFFINITY: Internal thread 6 bound to OS proc set {0,1,2,3,4,5,6,7}
OMP: Info #147: KMP_AFFINITY: Internal thread 5 bound to OS proc set {0,1,2,3,4,5,6,7}
OMP: Info #147: KMP_AFFINITY: Internal thread 7 bound to OS proc set {0,1,2,3,4,5,6,7}
Please note that I am calling a shared library from MATLAB and threads 0...3 are created by MATLAB.
I tried to bind threads to some specific CPUs but it causes serious degradation in the performance.
Please let me know what I should do to fix the problem.
Thanks
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
As an update, I used export KMP_AFFINITY=verbose,granularity=find,proclist=[0,1,2,3],explicit to bind threads to a different processor. Here is the output:
OMP: Info #147: KMP_AFFINITY: Internal thread 4 bound to OS proc set {0,4}
OMP: Info #147: KMP_AFFINITY: Internal thread 5 bound to OS proc set {1,5}
OMP: Info #147: KMP_AFFINITY: Internal thread 6 bound to OS proc set {2,6}
OMP: Info #147: KMP_AFFINITY: Internal thread 7 bound to OS proc set {3,7}
The threaded version of LU factorization takes 0.238 sec while sequential version takes 0.757 sec. So, threading definitely works here. However, threaded version for dss_solve_real takes 6 times more than sequential version which so wierd.

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page