Solved: Re: dgetri performance issues

hpc-matt · ‎11-02-2009

I have been using the dgetri and dgetrf functions on my machine, but the perforamnce that I have been getting on my random matracies has been extremely poor (5 g/flops running on all 4 cores)

My question is: Is there something that I have setup wrong if I resort to default setting on my ifort and icc setups or perhaps calling it incorectly?

Specs:
Intel Q6600 Core 2 Quad
4 GB DDR2 RAM
Ubuntu 9.04 x86_64

Code (C):

for(x=0;x {
//Variables Needed to be reset for
M=j;
N=j;
LDA=M;
LWORK=N;
INFO=0;
createMatrix(&M, &N, &A);
IPIV=(MKL_INT *)malloc(M*sizeof(int));
WORK=(double *)malloc(M*sizeof(double));

DGETRF( &M, &N, A, &LDA, IPIV, &INFO );
gettimeofday(&time_s, NULL);
DGETRI( &N, A, &LDA, IPIV, WORK, &LWORK, &INFO );
gettimeofday(&time_e, NULL);

cpuTime=0;
CPU_gflops=0;
temp=0;
cpuTime=1e3*(time_e.tv_sec -time_s.tv_sec) + (time_e.tv_usec
-time_s.tv_usec)*1e-3;
//Found in lawn41 lapack manual for greatest term in O(n) notation, p121
temp = (1.0f*M*N*N);//O(2mn^2)

CPU_gflops = (temp/cpuTime) * 1e-6;
avg_flops=CPU_gflops;

free(A);
free(IPIV);
free(WORK);

}

Makefile:

FC = ifort
CC = icc
FCFLAGS = -O3 -cm -w
CCFLAGS = -O3
CXXDIR = /opt/intel/Compiler/11.1/038
LIBDIR:= $(CXXDIR)/mkl/lib/em64t
LIBS:= $(LIBDIR)/libmkl_intel_lp64.a
LIBS += -Wl,--start-group -L$(LIBDIR) $(LIBDIR)/libmkl_intel_thread.a $(LIBDIR)/libmkl_core.a -Wl,--end-group -L$(LIBDIR) -liomp5 -lpthread

OBJECTS = makematrix.o \
MatrixMath.o

DGETRI : $(OBJECTS) DGETRIDriver.o
$(CC) -o $@ $(OBJECTS) DGETRIDriver.o -L$(LIBDIR) $(LIBS)

This is my first time working with MKL, so any help is appreciated, Thanks!

Matt

Alexander_K_Intel3 · ‎11-04-2009

Matt,

There is not enought workspaceyou allocated for DGETRIto achieve high performance. You should use LWORK=N*NB, where particullary NB=64. You could also request optimalworkspace size from the DGETRI itself:

int MONE=-1;
double LWKOPT;
DGETRI( &N, A, &LDA, IPIV, &LWKOPT, &MONE, &INFO );
LWORK=(int)LWKOPT;

Please also point attention that in your example instead of
WORK=(double *)malloc(M*sizeof(double));
should be:
WORK=(double *)malloc(LWORK*sizeof(double));

--Alexander

View solution in original post

Gennady_F_Intel · ‎11-02-2009

Quoting - hpc-matt

I have been using the dgetri and dgetrf functions on my machine, but the perforamnce that I have been getting on my random matracies has been extremely poor (5 g/flops running on all 4 cores)

My question is: Is there something that I have setup wrong if I resort to default setting on my ifort and icc setups or perhaps calling it incorectly?

Specs:
Intel Q6600 Core 2 Quad
4 GB DDR2 RAM
Ubuntu 9.04 x86_64

Code (C):

for(x=0;x{
//Variables Needed to be reset for
M=j;
N=j;
LDA=M;
LWORK=N;
INFO=0;
createMatrix(&M, &N, &A);
IPIV=(MKL_INT *)malloc(M*sizeof(int));
WORK=(double *)malloc(M*sizeof(double));

DGETRF( &M, &N, A, &LDA, IPIV, &INFO );
gettimeofday(&time_s, NULL);
DGETRI( &N, A, &LDA, IPIV, WORK, &LWORK, &INFO );
gettimeofday(&time_e, NULL);

cpuTime=0;
CPU_gflops=0;
temp=0;
cpuTime=1e3*(time_e.tv_sec -time_s.tv_sec) + (time_e.tv_usec
-time_s.tv_usec)*1e-3;
//Found in lawn41 lapack manual for greatest term in O(n) notation, p121
temp = (1.0f*M*N*N);//O(2mn^2)

CPU_gflops = (temp/cpuTime) * 1e-6;
avg_flops=CPU_gflops;

free(A);
free(IPIV);
free(WORK);

}

Makefile:

FC = ifort
CC = icc
FCFLAGS = -O3 -cm -w
CCFLAGS = -O3
CXXDIR = /opt/intel/Compiler/11.1/038
LIBDIR:= $(CXXDIR)/mkl/lib/em64t
LIBS:= $(LIBDIR)/libmkl_intel_lp64.a
LIBS += -Wl,--start-group -L$(LIBDIR) $(LIBDIR)/libmkl_intel_thread.a $(LIBDIR)/libmkl_core.a -Wl,--end-group -L$(LIBDIR) -liomp5 -lpthread

OBJECTS = makematrix.o
MatrixMath.o

DGETRI : $(OBJECTS) DGETRIDriver.o
$(CC) -o $@ $(OBJECTS) DGETRIDriver.o -L$(LIBDIR) $(LIBS)

This is my first time working with MKL, so any help is appreciated, Thanks!

Matt

Matt,
it will depends on the size of task you are running on these 4 cores.
Intel Math Kernel Library (Intel MKL) offers highly optimized routines for middle and large input sizes.
For you reference, please see
http://software.intel.com/sites/products/collateral/hpc/mkl/mkl_indepth.pdf
you can find there some performance data for dgetrf of MKL vs Atlas.
--Gennady

hpc-matt · ‎11-03-2009

Quoting - Gennady Fedorov (Intel)

Matt,
it will depends on the size of task you are running on these 4 cores.
Intel Math Kernel Library (Intel MKL) offers highly optimized routines for middle and large input sizes.
For you reference, please see
http://software.intel.com/sites/products/collateral/hpc/mkl/mkl_indepth.pdf
you can find there some performance data for dgetrf of MKL vs Atlas.
--Gennady

I am using matrcies of dimension 2k ~12k. I have been benchmarking my machine, and the dgetrf routine is about he same as the standard benchamrks, however the DGETRI funciton is underperforming substatially. I realize the runtime complexity is on the order of O(n*m^2), but still, if i can get 30+ g/Flops for dgetrf, I should be able ot get half of that using the dgetri. I am currently getting around 3gflops, with decreasing performance as size increases. It also does not matter if I am using fortran or C. Thanks!

TimP · ‎11-04-2009

In case correcting your assignment of lwork doesn't help:
It looks as if you are hitting cache capacity limit. Did you check cache events? It may be interesting, once you find which function is taking up time, to compile that one from source so as to analyze by VTune or PTU.

Alexander_K_Intel3 · ‎11-04-2009

Matt,

There is not enought workspaceyou allocated for DGETRIto achieve high performance. You should use LWORK=N*NB, where particullary NB=64. You could also request optimalworkspace size from the DGETRI itself:

int MONE=-1;
double LWKOPT;
DGETRI( &N, A, &LDA, IPIV, &LWKOPT, &MONE, &INFO );
LWORK=(int)LWKOPT;

Please also point attention that in your example instead of
WORK=(double *)malloc(M*sizeof(double));
should be:
WORK=(double *)malloc(LWORK*sizeof(double));

--Alexander

hpc-matt · ‎11-10-2009

Quoting - Alexander Kobotov (Intel)

Matt,

There is not enought workspaceyou allocated for DGETRIto achieve high performance. You should use LWORK=N*NB, where particullary NB=64. You could also request optimalworkspace size from the DGETRI itself:

int MONE=-1;
double LWKOPT;
DGETRI( &N, A, &LDA, IPIV, &LWKOPT, &MONE, &INFO );
LWORK=(int)LWKOPT;

Please also point attention that in your example instead of
WORK=(double *)malloc(M*sizeof(double));
should be:
WORK=(double *)malloc(LWORK*sizeof(double));

--Alexander

Thanks, that did improve my results substatially. I am working on getting VTune setup and working now. Thanks all for your help.

Matt