What performance I should expect from following code

Pouya_Z_ — Wed, 04 Sep 2013 22:35:35 GMT

Consider following two part of the codes:

/* Perform LU factorization and store in DSS_handle */
for(k = 0; k < N; k++){
gettimeofday(&stTime, NULL);
//DSS solver options
MKL_INT solOpt = (MKL_DSS_DEFAULTS | MKL_DSS_REFINEMENT_OFF) | MKL_DSS_TRANSPOSE_SOLVE;
MKL_INT nRhs = 3;
dss_solve_real(DSS_handle, solOpt, bufferRHS, nRhs, bufferX3);
dssSolCnt++;
gettimeofday(&endTime, NULL);
dssSolTime += (double)(endTime.tv_sec*1000000 + endTime.tv_usec - stTime.tv_sec*1000000 - stTime.tv_usec);
/* Do some other things */
}

For this code, dssSolTime, which represents the time required to performe forward and backward solutions, is 19.87sec for a 3408 * 3408 matrix.

Now, if I do the same calculations sequentially using following code,

/* Perform LU factorization and store in DSS_handle */
for(k = 0; k < N; k++){
gettimeofday(&stTime, NULL);
//DSS solver options
MKL_INT solOpt = (MKL_DSS_DEFAULTS | MKL_DSS_REFINEMENT_OFF) | MKL_DSS_TRANSPOSE_SOLVE;
MKL_INT nRhs = 1;
dss_solve_real(DSS_handle, solOpt, bufferRHS, nRhs, bufferX3);
dss_solve_real(DSS_handle, solOpt, bufferRHS+numOfEqs, nRhs, bufferX3+numOfEqs);
dss_solve_real(DSS_handle, solOpt, bufferRHS+2*numOfEqs, nRhs, bufferX3+2*numOfEqs);
dssSolCnt++;
gettimeofday(&endTime, NULL);
dssSolTime += (double)(endTime.tv_sec*1000000 + endTime.tv_usec - stTime.tv_sec*1000000 - stTime.tv_usec);
/* Do some other things */
}

it completes the computations much faster anf dssSolTime will be 2.04sec for the matrix (almost 10 times faster when I ask dss_solve_real to solve for all righ-hand-side vectors.)

I assumed that dss_solve_real is smart enough to create three threads to solve for all right-hand side vectors simultaneously. Therefore, I expected first code to be three times faster than second code. But, the huge performance degradation implies that I may be missing something here. So, it is appreciated if you let me know whether or not dss_solve_real can solve for three right-hand-side vectors in parallel. Also, kindly let me know what I should logically expect from these codes and which one should be faster.

Thanks

Performance of the test cases

SergeyKostrov — Fri, 06 Sep 2013 13:14:39 GMT

Performance of the test cases depends on an Intel instruction set selected during CPU dispatching ( mkl_core.dll -> mkl_rt.dll -> some MKL CPU dispatching DLL ). You have not provided any details about OS and hardware.

topic Performance of the test cases in Intel® oneAPI Math Kernel Library

What performance I should expect from following code

Performance of the test cases