Solved: Comparison between MKL_lapack and LAPACK provided with mac ox

Sunny_Beast · ‎08-17-2011

Dear support provider

I tried MKL_LAPACK and LAPACK provided with MAC-OX 10.6

here is the code:

#include
#include
#include
using namespace std;

/* Complex datatype */
struct _dcomplex { double re, im; };
typedef struct _dcomplex dcomplex;

/* ZGELSS prototype */
extern "C" { // LAPACK routines
void zgelss_(int*m, int* n, int* nrhs, dcomplex* a, int* lda, dcomplex* b, int* ldb, double* s, double* rcond, int* rank, dcomplex* work, int* lwork, double* rwork, int* info );}
/* Auxiliary routines prototypes */
extern void print_matrix( char* desc, int m, int n, dcomplex* a, int lda );
extern void print_double_vector( char* desc, int n, double* a );

/* Parameters */
#define N 4
#define NRHS 2
#define LDA N
#define LDB N

/* Main program */
int main() {
/* Locals */
int m = N, n = N, nrhs = NRHS, lda = LDA, ldb = 4, info, lwork = -1, rank;
double rcond = 0, rwork[20];
/* Local arrays */
double s[4];
dcomplex work[264];
dcomplex a[LDA*N] = {
{ 1.23, -5.50}, {-2.14, -1.12}, {-4.30, -7.10}, { 1.27, 7.29},
{ 7.91, -5.38}, {-9.92, -0.79}, {-6.47, 2.52}, { 8.90, 6.92},
{-9.80, -4.86}, {-9.18, -1.12}, {-6.51, -2.67}, {-8.82, 1.25},
{-7.32, 7.57}, { 1.37, 0.43}, {-5.86, 7.38}, { 5.41, 5.37}
};
dcomplex b[LDB*NRHS] = {
{ 8.33, -7.32}, {-6.18, -4.80}, {-5.71, -2.80}, {-1.60, 3.08},
{-6.11, -3.81}, { 0.14, -7.71}, { 1.41, 3.40}, { 8.54, -4.05}
};
/* Executable statements */
printf( " ZGELSS Example Program Results\\n" );
/* Solve the equations A*X = B */

for (int loop = 0 ; loop < 1000000; loop++)
zgelss_( &m, &n, &nrhs, a, &lda, b, &ldb, s, &rcond, &rank, work, &lwork, rwork, &info );

/* Check for the exact singularity */
if( info != 0 ) {
printf( "The diagonal element of the triangular factor of A,\\n" );
printf( "U(%i,%i) is zero, so that A is singular;\\n", info, info );
printf( "the solution could not be computed.\\n" );
exit( 1 );
}
/* Print solution */
print_matrix( "Solution", n, nrhs, b, ldb );
/* Print details of sv decomposition */
print_matrix( "Details of LU factorization", n, n, a, lda );
/* Print singular values */
print_double_vector( "Pivot indices", n, s );

cout << "work = " < exit( 0 );
} /* End of ZGESV Example */

/* Auxiliary routine: printing a matrix */
void print_matrix( char* desc, int m, int n, dcomplex* a, int lda ) {
int i, j;
printf( "\\n %s\\n", desc );
for( i = 0; i < m; i++ ) {
for( j = 0; j < n; j++ )
printf( " (%6.2f,%6.2f)", a[i+j*lda].re, a[i+j*lda].im );
printf( "\\n" );
}
}

/* Auxiliary routine: printing a vector of integers */
void print_double_vector( char* desc, int n, double* a ) {
int j;
printf( "\\n %s\\n", desc );
for( j = 0; j < n; j++ ) printf( " %6.2f", a );
printf( "\\n" );
}

I tried following compiler statements
1-->icpc main.cpp -o result_t -lmkl_intel_lp64 -lmkl_intel_thread -lmkl_core -openmp -lpthread -DMKL_ILP64

RESULTS

ZGELSS Example Program Results

Solution
( 0.43, -0.90) ( -2.22, 0.84)
( -0.51, 0.18) ( 1.13, 1.21)
( 1.25, -0.43) ( -1.88, -0.43)
( 0.73, -0.33) ( -0.14, 0.17)

Details of LU factorization
( -1.00, -0.00) ( -0.00, -0.00) ( -0.00, -0.00) ( -0.00, -0.00)
( -0.00, -0.00) ( -1.00, -0.00) ( -0.00, 0.00) ( -0.00, 0.00)
( -0.00, -0.00) ( -0.00, -0.00) ( -1.00, -0.00) ( -0.00, 0.00)
( 0.00, 0.00) ( 0.00, 0.00) ( 0.00, 0.00) ( 1.00, 0.00)

s
1.00 1.00 1.00 1.00

work = 1032
5.343u 0.009s 0:05.35 99.8% 0+0k 0+0io 0pf+0w

2.-->icpc main.cpp -o result_t -llapack

RESULTS
ZGELSS Example Program Results

Solution
( 1.03, -0.34) ( -1.89, 0.63)
( -0.13, -0.25) ( -0.31, 0.07)
( 0.25, -0.45) ( -1.48, 0.35)
( -1.39, -0.49) ( 0.43, 2.36)

Details of LU factorization
( 1.00, 0.00) ( 0.00, 0.00) ( 0.00, 0.00) ( 0.00, 0.00)
( -0.00, 0.00) ( -1.00, 0.00) ( 0.00, 0.00) ( 0.00, 0.00)
( -0.00, 0.00) ( -0.00, 0.00) ( -1.00, 0.00) ( 0.00, 0.00)
( -0.00, 0.00) ( -0.00, 0.00) ( -0.00, 0.00) ( -1.00, 0.00)

Pivot indices
1.00 1.00 1.00 1.00
work = 264
3.009u 0.001s 0:03.01 99.6% 0+0k 0+0io 0pf+0w

3.-->g++ main.cpp -o result_t -llapack

RESULTS

ZGELSS Example Program Results

Solution
( 1.03, -0.34) ( -1.89, 0.63)
( -0.13, -0.25) ( -0.31, 0.07)
( 0.25, -0.45) ( -1.48, 0.35)
( -1.39, -0.49) ( 0.43, 2.36)

Details of LU factorization
( 1.00, 0.00) ( 0.00, 0.00) ( 0.00, 0.00) ( 0.00, 0.00)
( -0.00, 0.00) ( -1.00, 0.00) ( 0.00, 0.00) ( 0.00, 0.00)
( -0.00, 0.00) ( -0.00, 0.00) ( -1.00, 0.00) ( 0.00, 0.00)
( -0.00, 0.00) ( -0.00, 0.00) ( -0.00, 0.00) ( -1.00, 0.00)

Pivot indices
1.00 1.00 1.00 1.00
work = 264
2.991u 0.000s 0:02.99 100.0% 0+0k 0+0io 0pf+0w

I found LAPACK with macox ~2 time faster

-->>Any how my querry is possible to link MACOX LAPACK and other MKL_libs, "specially mkl_vml" and develop a multithreaded application.

if so please guide me through Linking statement for doing so.

Konstantin_A_Intel · ‎08-18-2011

Hi Sunny,

As it has already been mentioned, 4x4 size of the matrix is too small to gain from internal LAPACK parallelization. I suggest you to parallelize the loop over LAPACK routine. Ichanged your example appropriately, just to demonstrated the way how it could be done in the real program (see attach).

Ihope the performance will be better for you (I used 2x6 cores system):

bash-4.1$ ./run.sh
---------------------sequential test
ZGELSS Example Program Results
Running on 1 threads.
time = 0.65653
---------------------parallel test
ZGELSS Example Program Results
Running on 12 threads.
time = 0.0881447
---------------------

bash-4.1$ cat run.sh
#!/bin/sh

export MKLROOT=WRITE_YOUR_PATH_TO_MKL_HERE/__release_lnx/mkl
export LD_LIBRARY_PATH=$MKLROOT/lib/intel64:$LD_LIBRARY_PATH

icpc -I$MKLROOT/include test.cpp -o test.exe -L$MKLROOT/lib/intel64 -lmkl_intel_lp64 -lmkl_intel_thread -lmkl_core -openmp -lpthread

echo "---------------------sequential test"
export OMP_NUM_THREADS=1
./test.exe

echo "---------------------parallel test"
export OMP_NUM_THREADS=12
./test.exe
echo "---------------------"

bash-4.1$ cat test.cpp

#include
#include
#include
#include
#include

using namespace std;

/* Complex datatype */
struct _dcomplex { double re, im; };
typedef struct _dcomplex dcomplex;

/* Parameters */
#define N 4
#define NRHS 2
#define LDA N
#define LDB N
#define MAX_NTHR 12

/* Main program */
int main() {
/* Locals */
int m = N, n = N, nrhs = NRHS, lda = LDA, ldb = 4, info, lwork = -1, rank[MAX_NTHR];
double rcond = 0, rwork[20], sec;
/* Local arrays */
double s[4*MAX_NTHR];
dcomplex work[264*MAX_NTHR];
dcomplex a[LDA*N*MAX_NTHR] = {
{ 1.23, -5.50}, {-2.14, -1.12}, {-4.30, -7.10}, { 1.27, 7.29},
{ 7.91, -5.38}, {-9.92, -0.79}, {-6.47, 2.52}, { 8.90, 6.92},
{-9.80, -4.86}, {-9.18, -1.12}, {-6.51, -2.67}, {-8.82, 1.25},
{-7.32, 7.57}, { 1.37, 0.43}, {-5.86, 7.38}, { 5.41, 5.37}
};
dcomplex b[LDB*NRHS*MAX_NTHR] = {
{ 8.33, -7.32}, {-6.18, -4.80}, {-5.71, -2.80}, {-1.60, 3.08},
{-6.11, -3.81}, { 0.14, -7.71}, { 1.41, 3.40}, { 8.54, -4.05}
};
/* Executable statements */
printf( " ZGELSS Example Program Results\n" );
/* Solve the equations A*X = B */

for(int i=1;i for(int j=0;j a[i*LDA*N+j].re=a.re;
a[i*LDA*N+j].im=a.im;
}
for(int j=0;j b[i*LDB*NRHS+j].re=b.re;
b[i*LDB*NRHS+j].im=b.im;
}
}
sec = dsecnd();
sec = dsecnd();

int nthr = omp_get_max_threads();
if (nthr > MAX_NTHR) nthr=MAX_NTHR;
cout << "Running on " << nthr << " threads." << endl;
#pragma omp parallel num_threads(nthr)
{
#pragma omp for
for (int loop = 0 ; loop < 1000000; loop++)
{
int id=loop%nthr;
zgelss_( &m, &n, &nrhs, (MKL_Complex16 *)&(a[id*LDA*N]), &lda, (MKL_Complex16 *)&(b[id*LDB*N]), &ldb, &(s[id*4]), &rcond, &rank[id], (MKL_Complex16 *)&(work[id*264]), &lwork, rwork, &info );
}
}
sec = dsecnd()-sec;
cout << "time = " << sec << endl;
exit( 0 );
} /* End of ZGESV Example */

Regards,
Konstantin

View solution in original post

Gennady_F_Intel · ‎08-17-2011

Sunny,

the task size is extremely small.Please compare the performance with the real task size, let say then n >= 1000.

--Gennady

Sunny_Beast · ‎08-17-2011

Dear Gennady

The Application I am working on uses this task size only.
I belive MKL_LAPACK would be faster for bigger task sizes, I haven't tried it though;

I asked for VML specially because I have to use elemental math for LAGRE arrays for sizes up 2^48;
Also I find MKL_DFTI routines easier to use compared to fftw.

the code I posted is posted here will used as a part of larger application part of larger application.
but it has to calculate LSS for small task sizes only though upto 2^48 times.

So I think for my present scenario best performance can be obtained using

MKL_DFTI
MKL_VML
MKL_CBLAS

and MAC-OX LAPACK
and
ICPC {for vectorization of loops and parallelization}

any how my query still remains

Is posible to link these libraries to make a multitheded application.
and if so please guide through the linking statement

Sincerely

Sunny

Konstantin_A_Intel · ‎08-18-2011

Hi,

Once you've linked your application with MKL - result_t -lmkl_intel_lp64 -lmkl_intel_thread -lmkl_core -openmp -lpthread - you can use any MKL functions from any domains like DFT, VML etc. I only recommend you to include "mkl.h"header into your program. And multithreading is also enabled by default as you linked with -lmkl_intel_thread.

One note: please do not use -DMKL_ILP64 flag since it assumes your integer data passed to MKL is 8-byte long (e.g. long long int). I see that your program uses plain int type so there's not need to specify -DMKL_ILP64.

Regards,
Konstantin

Victor_K_Intel1 · ‎08-18-2011

Sunny,

The code you provided is somewhat misleading. You call zgelss which is destined to solve a system of linear equations with rectangular matrix of coefficients and under assumption that the matrix may be rank deficient (please, check the manual). I am asking because prints after call have titles like "Details of LU...", "pivot indices", etc - I conclude you probably wanted to use something like zgesv.

Besides, the function was called without restoring the matrix. From description of the zgelss you can read that after cal the matrix is replaced with its right singular vectors. The vectors form an unitary matrix and further calls of the same function deal with unitary matrices. Moreover, as we actualy deal with sime kind of iterations it looks that those matrices converged to the unit matrix - see you print under title "Details of LU factorization" (situation may be more complicated - like you have not unit matrix but diagonal matrix with +-1 on the diagonal but this does not matter). This specificities may result in simplification of some computations. So, it may be the timing is not quite correct.

BTW, you printed singualr values of the last unitary matrix under the title "Pivot indices". That's right, singular values of a unitary matrix must be equal to 1.

Thanks,
Victor

Sunny_Beast · ‎08-18-2011

Hi Victor

I modified the example version of zgesv (link: http://software.intel.com/sites/products/documentation/hpc/mkl/lapack/mkl_lapack_examples/index.htm ) to calculate zgelss. And I apologize that I didn't changed the printing subroutines.

Any way--> I need to calculate the inverses of rank deficient matrices -- as at times , in my algorithm all the rows may not be filled and will have zero values making it singular or rank deficienet which I beleive can be taken care by routines which work on SVD. At present my program has the "bottel neck" at place where it has to calculate LSS. Also the part of program can't be auto parallelized as Algorithm has "Dependencies" which can't be taken care of.

I also tried the code on my original program "I can't post here"

Using icpc and MKL I was able to iprove the performances of vector Math calculations and DFTI; and found them esaier to use too.
for all those subroutines time improved by factor of three i.e from 10s to 3s;

But when I included the zgelss routine to it the perfomance actually decreased. ie. 35s(from gcc) to 52s(from icc -MKL) and I am compileng both GCC and ICC for 8byte integers (needed at few places in algorithm).
And Again--> task size is small but matrix array is quiet Long.

can you suggest some solution for this..

Sincerely
Thanks

Sunny

Sunny_Beast · ‎08-18-2011

dear Konstantin Arturov

even after removing -DMKL_ILP64 form compiler statement restuls are same..

time ./result_t
ZGELSS Example Program Results

Solution
( 0.43, -0.90) ( -2.22, 0.84)
( -0.51, 0.18) ( 1.13, 1.21)
( 1.25, -0.43) ( -1.88, -0.43)
( 0.73, -0.33) ( -0.14, 0.17)

Details of LU factorization
( -1.00, -0.00) ( -0.00, -0.00) ( -0.00, -0.00) ( -0.00, -0.00)
( -0.00, -0.00) ( -1.00, -0.00) ( -0.00, 0.00) ( -0.00, 0.00)
( -0.00, -0.00) ( -0.00, -0.00) ( -1.00, -0.00) ( -0.00, 0.00)
( 0.00, 0.00) ( 0.00, 0.00) ( 0.00, 0.00) ( 1.00, 0.00)

Pivot indices
1.00 1.00 1.00 1.00
work = 1032
5.228u 0.009s 0:05.23 99.8% 0+0k 0+0io 0pf+0w

Sunny_Beast · ‎08-18-2011

FYI:

lwork = -1 in here
I run this code for calculating work size;

and replaced lwork to (int)work[0].re before getting the results..

Konstantin_A_Intel · ‎08-18-2011

Hi Sunny,

As it has already been mentioned, 4x4 size of the matrix is too small to gain from internal LAPACK parallelization. I suggest you to parallelize the loop over LAPACK routine. Ichanged your example appropriately, just to demonstrated the way how it could be done in the real program (see attach).

Ihope the performance will be better for you (I used 2x6 cores system):

bash-4.1$ ./run.sh
---------------------sequential test
ZGELSS Example Program Results
Running on 1 threads.
time = 0.65653
---------------------parallel test
ZGELSS Example Program Results
Running on 12 threads.
time = 0.0881447
---------------------

bash-4.1$ cat run.sh
#!/bin/sh

export MKLROOT=WRITE_YOUR_PATH_TO_MKL_HERE/__release_lnx/mkl
export LD_LIBRARY_PATH=$MKLROOT/lib/intel64:$LD_LIBRARY_PATH

icpc -I$MKLROOT/include test.cpp -o test.exe -L$MKLROOT/lib/intel64 -lmkl_intel_lp64 -lmkl_intel_thread -lmkl_core -openmp -lpthread

echo "---------------------sequential test"
export OMP_NUM_THREADS=1
./test.exe

echo "---------------------parallel test"
export OMP_NUM_THREADS=12
./test.exe
echo "---------------------"

bash-4.1$ cat test.cpp

#include
#include
#include
#include
#include

using namespace std;

/* Complex datatype */
struct _dcomplex { double re, im; };
typedef struct _dcomplex dcomplex;

/* Parameters */
#define N 4
#define NRHS 2
#define LDA N
#define LDB N
#define MAX_NTHR 12

/* Main program */
int main() {
/* Locals */
int m = N, n = N, nrhs = NRHS, lda = LDA, ldb = 4, info, lwork = -1, rank[MAX_NTHR];
double rcond = 0, rwork[20], sec;
/* Local arrays */
double s[4*MAX_NTHR];
dcomplex work[264*MAX_NTHR];
dcomplex a[LDA*N*MAX_NTHR] = {
{ 1.23, -5.50}, {-2.14, -1.12}, {-4.30, -7.10}, { 1.27, 7.29},
{ 7.91, -5.38}, {-9.92, -0.79}, {-6.47, 2.52}, { 8.90, 6.92},
{-9.80, -4.86}, {-9.18, -1.12}, {-6.51, -2.67}, {-8.82, 1.25},
{-7.32, 7.57}, { 1.37, 0.43}, {-5.86, 7.38}, { 5.41, 5.37}
};
dcomplex b[LDB*NRHS*MAX_NTHR] = {
{ 8.33, -7.32}, {-6.18, -4.80}, {-5.71, -2.80}, {-1.60, 3.08},
{-6.11, -3.81}, { 0.14, -7.71}, { 1.41, 3.40}, { 8.54, -4.05}
};
/* Executable statements */
printf( " ZGELSS Example Program Results\n" );
/* Solve the equations A*X = B */

for(int i=1;i for(int j=0;j a[i*LDA*N+j].re=a.re;
a[i*LDA*N+j].im=a.im;
}
for(int j=0;j b[i*LDB*NRHS+j].re=b.re;
b[i*LDB*NRHS+j].im=b.im;
}
}
sec = dsecnd();
sec = dsecnd();

int nthr = omp_get_max_threads();
if (nthr > MAX_NTHR) nthr=MAX_NTHR;
cout << "Running on " << nthr << " threads." << endl;
#pragma omp parallel num_threads(nthr)
{
#pragma omp for
for (int loop = 0 ; loop < 1000000; loop++)
{
int id=loop%nthr;
zgelss_( &m, &n, &nrhs, (MKL_Complex16 *)&(a[id*LDA*N]), &lda, (MKL_Complex16 *)&(b[id*LDB*N]), &ldb, &(s[id*4]), &rcond, &rank[id], (MKL_Complex16 *)&(work[id*264]), &lwork, rwork, &info );
}
}
sec = dsecnd()-sec;
cout << "time = " << sec << endl;
exit( 0 );
} /* End of ZGESV Example */

Regards,
Konstantin

Sunny_Beast · ‎08-18-2011

Dear Konstantin Arturov

Thankyou

Sunny

Sunny_Beast · ‎08-19-2011

Dear Konstantin

Thanks a lot I think I will be able to impliment the change in my code and even would be able paralleize few more subrotines ..

though i want to ask are there example available for explicit parallelization using compiller hedears and libraries.

Konstantin_A_Intel · ‎08-19-2011

Sunny, I'm glad to help and hope it'll improve the performance of your program.

Re "explicit parallelization using compiller hedears and libraries" - could you please clarify what precisely do you mean? If you mean libraries like MKL, than all the functions insideit are well optimized, but require tasks of larger size in order to start scaling.

If you mean an example where compiler might parallelize the code itself - this example doesn't fit to this conditions as there's a function call inside the loop and compiler cannot realize that the iterations could be run in parallel. It seems just impossible for the compiler.

To parallelize the code I used OpenMP technology supported by Intel compilers (included "omp.h" header, used #pragma omp directives and linked the application with -openmp flag). You may know more about OpenMP here:
http://www.openmp.org/mp-documents/spec30.pdf

Regards,
Konstantin

Sunny_Beast · ‎08-20-2011

Dear Support Provider

I tried to execute the above code as directed and it runs fine and give quite good scaling;
BUt when the wrok space query is removed i.e lwork is to 1032 teh code genrates the following result

MKL ERROR: Parameter 7 was incorrect on entry to ZGELSSMKL ERROR: Parameter 7 was incorrect on entry to ZGELSSMKL ERROR: Parameter 7 was incorrect on entry to ZGELSSMKL ERROR: Parameter 7 was incorrect on entry to ZGELSSMKL ERROR: Parameter 7 was incorrect on entry to ZGELSSMKL ERROR: Parameter 7 was incorrect on entry to ZGELSSMKL ERROR: Parameter 7 was incorrect on entry to ZGELSSMKL ERROR: Parameter 7 was incorrect on entry to ZGELSS

FYI:
I used the following statement

lwork = 1032;

zgelss_( &m, &n, &nrhs, (MKL_Complex16 *)&(a[id*LDA*N]), &lda, (MKL_Complex16 *)&(b[id*LDB*N]), &ldb, &(s[id*4]), &rcond, &rank[id], (MKL_Complex16 *)&(work[id*1032]), &lwork, &rwork[id*20], &info[id] );

Konstantin_A_Intel · ‎08-21-2011

Hi Sunny,

there's an error in the following statement: b[id*LDB*N]

In fact, it should be b[id*LDB*NRHS]:

zgelss_( &m, &n, &nrhs, (MKL_Complex16 *)&(a[id*LDA*N]), &lda, (MKL_Complex16 *)&(b[id*LDB*NRHS]), &ldb, &(s[id*4]), &rcond, &rank[id], (MKL_Complex16 *)&(work[id*1032]), &lwork, &rwork[id*20], &info[id] );

I hope this should work for you.

Regards,
Konstantin