dss_solve_real takes more time to solve a linear system

Pouya_Z_ · ‎09-03-2013

I need to solve a system of linear equations with three righ-hand-side vectors. Initially, I was using the sequential version of MKL (compiling with "libmkl_sequential.a") and solving for each rhs vector sequentially as:

dss_solve_real(DSS_handle, solOpt, rhs1, 1, x1);

dss_solve_real(DSS_handle, solOpt, rhs1 + numOfVars, 1, x1 + numOfVars);

dss_solve_real(DSS_handle, solOpt, rhs1 + 2*numOfVars, 1, x1 + 2*numOfVars);

where, numOfVars represent number of variables.

Then, I decided to ask dss_solve_real to solve for all rhs vectors at once and I assumed that it will roughly lead to 3 times improvement. So, I compiled the code using "libmkl_intel_thread.a" and used following code:

dss_solve_real(DSS_handle, solOpt, rhs1, 3, x1);

In my surprise, the timing is very wierd. Sequential version takes 0.548 sec while when I want to solve for all rhs vectors at once, it takes 5.024sec, which is almost 10 times more than sequential version.

I feel there is something wrong here and I may be needed to set some environment variables. So, please let me know if you have similar experience.

Any help is appreciated.

Gennady_F_Intel · ‎09-03-2013

what is the problem size? and did you measure the dss_solve_real stage only?

Pouya_Z_ · ‎09-03-2013

The matrix is 3408 * 3408 and I am just measuring the execution time of dss_solve_real.

I set KMP_AFFINITY=verbose to monitor thread creation and apparently mkl creates 4 threads. Here's the output:

OMP: Info #147: KMP_AFFINITY: Internal thread 4 bound to OS proc set {0,1,2,3,4,5,6,7}
OMP: Info #147: KMP_AFFINITY: Internal thread 6 bound to OS proc set {0,1,2,3,4,5,6,7}
OMP: Info #147: KMP_AFFINITY: Internal thread 5 bound to OS proc set {0,1,2,3,4,5,6,7}
OMP: Info #147: KMP_AFFINITY: Internal thread 7 bound to OS proc set {0,1,2,3,4,5,6,7}

Please note that I am calling a shared library from MATLAB and threads 0...3 are created by MATLAB.

I tried to bind threads to some specific CPUs but it causes serious degradation in the performance.

Please let me know what I should do to fix the problem.

Thanks

Pouya_Z_ · ‎09-03-2013

As an update, I used export KMP_AFFINITY=verbose,granularity=find,proclist=[0,1,2,3],explicit to bind threads to a different processor. Here is the output:

OMP: Info #147: KMP_AFFINITY: Internal thread 4 bound to OS proc set {0,4}
OMP: Info #147: KMP_AFFINITY: Internal thread 5 bound to OS proc set {1,5}
OMP: Info #147: KMP_AFFINITY: Internal thread 6 bound to OS proc set {2,6}
OMP: Info #147: KMP_AFFINITY: Internal thread 7 bound to OS proc set {3,7}

The threaded version of LU factorization takes 0.238 sec while sequential version takes 0.757 sec. So, threading definitely works here. However, threaded version for dss_solve_real takes 6 times more than sequential version which so wierd.