Intel® oneAPI Math Kernel Library
Ask questions and share information with other developers who use Intel® Math Kernel Library.

dgetri performance issues

hpc-matt
Beginner
754 Views
I have been using the dgetri and dgetrf functions on my machine, but the perforamnce that I have been getting on my random matracies has been extremely poor (5 g/flops running on all 4 cores)

My question is: Is there something that I have setup wrong if I resort to default setting on my ifort and icc setups or perhaps calling it incorectly?

Specs:
Intel Q6600 Core 2 Quad
4 GB DDR2 RAM
Ubuntu 9.04 x86_64

Code (C):

for(x=0;x {
//Variables Needed to be reset for
M=j;
N=j;
LDA=M;
LWORK=N;
INFO=0;
createMatrix(&M, &N, &A);
IPIV=(MKL_INT *)malloc(M*sizeof(int));
WORK=(double *)malloc(M*sizeof(double));




DGETRF( &M, &N, A, &LDA, IPIV, &INFO );
gettimeofday(&time_s, NULL);
DGETRI( &N, A, &LDA, IPIV, WORK, &LWORK, &INFO );
gettimeofday(&time_e, NULL);


cpuTime=0;
CPU_gflops=0;
temp=0;
cpuTime=1e3*(time_e.tv_sec -time_s.tv_sec) + (time_e.tv_usec
-time_s.tv_usec)*1e-3;
//Found in lawn41 lapack manual for greatest term in O(n) notation, p121
temp = (1.0f*M*N*N);//O(2mn^2)

CPU_gflops = (temp/cpuTime) * 1e-6;
avg_flops=CPU_gflops;

free(A);
free(IPIV);
free(WORK);

}

Makefile:

FC = ifort
CC = icc
FCFLAGS = -O3 -cm -w
CCFLAGS = -O3
CXXDIR = /opt/intel/Compiler/11.1/038
LIBDIR:= $(CXXDIR)/mkl/lib/em64t
LIBS:= $(LIBDIR)/libmkl_intel_lp64.a
LIBS += -Wl,--start-group -L$(LIBDIR) $(LIBDIR)/libmkl_intel_thread.a $(LIBDIR)/libmkl_core.a -Wl,--end-group -L$(LIBDIR) -liomp5 -lpthread

OBJECTS = makematrix.o \
MatrixMath.o




DGETRI : $(OBJECTS) DGETRIDriver.o
$(CC) -o $@ $(OBJECTS) DGETRIDriver.o -L$(LIBDIR) $(LIBS)


This is my first time working with MKL, so any help is appreciated, Thanks!

Matt
0 Kudos
1 Solution
Alexander_K_Intel3
754 Views
Matt,

There is not enought workspaceyou allocated for DGETRIto achieve high performance. You should use LWORK=N*NB, where particullary NB=64. You could also request optimalworkspace size from the DGETRI itself:

int MONE=-1;
double LWKOPT;
DGETRI( &N, A, &LDA, IPIV, &LWKOPT, &MONE, &INFO );
LWORK=(int)LWKOPT;

Please also point attention that in your example instead of
WORK=(double *)malloc(M*sizeof(double));
should be:
WORK=(double *)malloc(LWORK*sizeof(double));

--Alexander


View solution in original post

0 Kudos
5 Replies
Gennady_F_Intel
Moderator
754 Views
Quoting - hpc-matt
I have been using the dgetri and dgetrf functions on my machine, but the perforamnce that I have been getting on my random matracies has been extremely poor (5 g/flops running on all 4 cores)

My question is: Is there something that I have setup wrong if I resort to default setting on my ifort and icc setups or perhaps calling it incorectly?

Specs:
Intel Q6600 Core 2 Quad
4 GB DDR2 RAM
Ubuntu 9.04 x86_64

Code (C):

for(x=0;x{
//Variables Needed to be reset for
M=j;
N=j;
LDA=M;
LWORK=N;
INFO=0;
createMatrix(&M, &N, &A);
IPIV=(MKL_INT *)malloc(M*sizeof(int));
WORK=(double *)malloc(M*sizeof(double));




DGETRF( &M, &N, A, &LDA, IPIV, &INFO );
gettimeofday(&time_s, NULL);
DGETRI( &N, A, &LDA, IPIV, WORK, &LWORK, &INFO );
gettimeofday(&time_e, NULL);


cpuTime=0;
CPU_gflops=0;
temp=0;
cpuTime=1e3*(time_e.tv_sec -time_s.tv_sec) + (time_e.tv_usec
-time_s.tv_usec)*1e-3;
//Found in lawn41 lapack manual for greatest term in O(n) notation, p121
temp = (1.0f*M*N*N);//O(2mn^2)

CPU_gflops = (temp/cpuTime) * 1e-6;
avg_flops=CPU_gflops;

free(A);
free(IPIV);
free(WORK);

}

Makefile:

FC = ifort
CC = icc
FCFLAGS = -O3 -cm -w
CCFLAGS = -O3
CXXDIR = /opt/intel/Compiler/11.1/038
LIBDIR:= $(CXXDIR)/mkl/lib/em64t
LIBS:= $(LIBDIR)/libmkl_intel_lp64.a
LIBS += -Wl,--start-group -L$(LIBDIR) $(LIBDIR)/libmkl_intel_thread.a $(LIBDIR)/libmkl_core.a -Wl,--end-group -L$(LIBDIR) -liomp5 -lpthread

OBJECTS = makematrix.o
MatrixMath.o




DGETRI : $(OBJECTS) DGETRIDriver.o
$(CC) -o $@ $(OBJECTS) DGETRIDriver.o -L$(LIBDIR) $(LIBS)


This is my first time working with MKL, so any help is appreciated, Thanks!

Matt

Matt,
it will depends on the size of task you are running on these 4 cores.
Intel Math Kernel Library (Intel MKL) offers highly optimized routines for middle and large input sizes.
For you reference, please see
http://software.intel.com/sites/products/collateral/hpc/mkl/mkl_indepth.pdf
you can find there some performance data for dgetrf of MKL vs Atlas.
--Gennady
0 Kudos
hpc-matt
Beginner
754 Views

Matt,
it will depends on the size of task you are running on these 4 cores.
Intel Math Kernel Library (Intel MKL) offers highly optimized routines for middle and large input sizes.
For you reference, please see
http://software.intel.com/sites/products/collateral/hpc/mkl/mkl_indepth.pdf
you can find there some performance data for dgetrf of MKL vs Atlas.
--Gennady

I am using matrcies of dimension 2k ~12k. I have been benchmarking my machine, and the dgetrf routine is about he same as the standard benchamrks, however the DGETRI funciton is underperforming substatially. I realize the runtime complexity is on the order of O(n*m^2), but still, if i can get 30+ g/Flops for dgetrf, I should be able ot get half of that using the dgetri. I am currently getting around 3gflops, with decreasing performance as size increases. It also does not matter if I am using fortran or C. Thanks!
0 Kudos
TimP
Honored Contributor III
754 Views
In case correcting your assignment of lwork doesn't help:
It looks as if you are hitting cache capacity limit. Did you check cache events? It may be interesting, once you find which function is taking up time, to compile that one from source so as to analyze by VTune or PTU.
0 Kudos
Alexander_K_Intel3
755 Views
Matt,

There is not enought workspaceyou allocated for DGETRIto achieve high performance. You should use LWORK=N*NB, where particullary NB=64. You could also request optimalworkspace size from the DGETRI itself:

int MONE=-1;
double LWKOPT;
DGETRI( &N, A, &LDA, IPIV, &LWKOPT, &MONE, &INFO );
LWORK=(int)LWKOPT;

Please also point attention that in your example instead of
WORK=(double *)malloc(M*sizeof(double));
should be:
WORK=(double *)malloc(LWORK*sizeof(double));

--Alexander


0 Kudos
hpc-matt
Beginner
754 Views
Matt,

There is not enought workspaceyou allocated for DGETRIto achieve high performance. You should use LWORK=N*NB, where particullary NB=64. You could also request optimalworkspace size from the DGETRI itself:

int MONE=-1;
double LWKOPT;
DGETRI( &N, A, &LDA, IPIV, &LWKOPT, &MONE, &INFO );
LWORK=(int)LWKOPT;

Please also point attention that in your example instead of
WORK=(double *)malloc(M*sizeof(double));
should be:
WORK=(double *)malloc(LWORK*sizeof(double));

--Alexander


Thanks, that did improve my results substatially. I am working on getting VTune setup and working now. Thanks all for your help.

Matt
0 Kudos
Reply