topic Re: dgetri performance issues in Intel® oneAPI Math Kernel Library

dgetri performance issues

hpc-matt — Tue, 03 Nov 2009 01:18:45 GMT

I have been using the dgetri and dgetrf functions on my machine, but the perforamnce that I have been getting on my random matracies has been extremely poor (5 g/flops running on all 4 cores)

My question is: Is there something that I have setup wrong if I resort to default setting on my ifort and icc setups or perhaps calling it incorectly?

Specs:
Intel Q6600 Core 2 Quad
4 GB DDR2 RAM
Ubuntu 9.04 x86_64

Code (C):

for(x=0;x {
//Variables Needed to be reset for
M=j;
N=j;
LDA=M;
LWORK=N;
INFO=0;
createMatrix(&M, &N, &A);
IPIV=(MKL_INT *)malloc(M*sizeof(int));
WORK=(double *)malloc(M*sizeof(double));

DGETRF( &M, &N, A, &LDA, IPIV, &INFO );
gettimeofday(&time_s, NULL);
DGETRI( &N, A, &LDA, IPIV, WORK, &LWORK, &INFO );
gettimeofday(&time_e, NULL);

cpuTime=0;
CPU_gflops=0;
temp=0;
cpuTime=1e3*(time_e.tv_sec -time_s.tv_sec) + (time_e.tv_usec
-time_s.tv_usec)*1e-3;
//Found in lawn41 lapack manual for greatest term in O(n) notation, p121
temp = (1.0f*M*N*N);//O(2mn^2)

CPU_gflops = (temp/cpuTime) * 1e-6;
avg_flops=CPU_gflops;

free(A);
free(IPIV);
free(WORK);

}

Makefile:

FC = ifort
CC = icc
FCFLAGS = -O3 -cm -w
CCFLAGS = -O3
CXXDIR = /opt/intel/Compiler/11.1/038
LIBDIR:= $(CXXDIR)/mkl/lib/em64t
LIBS:= $(LIBDIR)/libmkl_intel_lp64.a
LIBS += -Wl,--start-group -L$(LIBDIR) $(LIBDIR)/libmkl_intel_thread.a $(LIBDIR)/libmkl_core.a -Wl,--end-group -L$(LIBDIR) -liomp5 -lpthread

OBJECTS = makematrix.o \
MatrixMath.o

DGETRI : $(OBJECTS) DGETRIDriver.o
$(CC) -o $@ $(OBJECTS) DGETRIDriver.o -L$(LIBDIR) $(LIBS)

This is my first time working with MKL, so any help is appreciated, Thanks!

Matt

Re: dgetri performance issues

Gennady_F_Intel — Tue, 03 Nov 2009 04:49:26 GMT

Quoting - hpc-matt

I have been using the dgetri and dgetrf functions on my machine, but the perforamnce that I have been getting on my random matracies has been extremely poor (5 g/flops running on all 4 cores)

My question is: Is there something that I have setup wrong if I resort to default setting on my ifort and icc setups or perhaps calling it incorectly?

Specs:
Intel Q6600 Core 2 Quad
4 GB DDR2 RAM
Ubuntu 9.04 x86_64

Code (C):

for(x=0;x{
//Variables Needed to be reset for
M=j;
N=j;
LDA=M;
LWORK=N;
INFO=0;
createMatrix(&M, &N, &A);
IPIV=(MKL_INT *)malloc(M*sizeof(int));
WORK=(double *)malloc(M*sizeof(double));

DGETRF( &M, &N, A, &LDA, IPIV, &INFO );
gettimeofday(&time_s, NULL);
DGETRI( &N, A, &LDA, IPIV, WORK, &LWORK, &INFO );
gettimeofday(&time_e, NULL);

cpuTime=0;
CPU_gflops=0;
temp=0;
cpuTime=1e3*(time_e.tv_sec -time_s.tv_sec) + (time_e.tv_usec
-time_s.tv_usec)*1e-3;
//Found in lawn41 lapack manual for greatest term in O(n) notation, p121
temp = (1.0f*M*N*N);//O(2mn^2)

CPU_gflops = (temp/cpuTime) * 1e-6;
avg_flops=CPU_gflops;

free(A);
free(IPIV);
free(WORK);

}

Makefile:

FC = ifort
CC = icc
FCFLAGS = -O3 -cm -w
CCFLAGS = -O3
CXXDIR = /opt/intel/Compiler/11.1/038
LIBDIR:= $(CXXDIR)/mkl/lib/em64t
LIBS:= $(LIBDIR)/libmkl_intel_lp64.a
LIBS += -Wl,--start-group -L$(LIBDIR) $(LIBDIR)/libmkl_intel_thread.a $(LIBDIR)/libmkl_core.a -Wl,--end-group -L$(LIBDIR) -liomp5 -lpthread

OBJECTS = makematrix.o
MatrixMath.o

DGETRI : $(OBJECTS) DGETRIDriver.o
$(CC) -o $@ $(OBJECTS) DGETRIDriver.o -L$(LIBDIR) $(LIBS)

This is my first time working with MKL, so any help is appreciated, Thanks!

Matt

Matt,
it will depends on the size of task you are running on these 4 cores.
Intel Math Kernel Library (Intel MKL) offers highly optimized routines for middle and large input sizes.
For you reference, please see
http://software.intel.com/sites/products/collateral/hpc/mkl/mkl_indepth.pdf
you can find there some performance data for dgetrf of MKL vs Atlas.
--Gennady

Re: dgetri performance issues

hpc-matt — Tue, 03 Nov 2009 17:32:45 GMT

Quoting - Gennady Fedorov (Intel)

I am using matrcies of dimension 2k ~12k. I have been benchmarking my machine, and the dgetrf routine is about he same as the standard benchamrks, however the DGETRI funciton is underperforming substatially. I realize the runtime complexity is on the order of O(n*m^2), but still, if i can get 30+ g/Flops for dgetrf, I should be able ot get half of that using the dgetri. I am currently getting around 3gflops, with decreasing performance as size increases. It also does not matter if I am using fortran or C. Thanks!

Re: dgetri performance issues

TimP — Wed, 04 Nov 2009 13:57:40 GMT

In case correcting your assignment of lwork doesn't help:
It looks as if you are hitting cache capacity limit. Did you check cache events? It may be interesting, once you find which function is taking up time, to compile that one from source so as to analyze by VTune or PTU.

Re: dgetri performance issues

Alexander_K_Intel3 — Thu, 05 Nov 2009 06:33:53 GMT

Matt,

There is not enought workspaceyou allocated for DGETRIto achieve high performance. You should use LWORK=N*NB, where particullary NB=64. You could also request optimalworkspace size from the DGETRI itself:

int MONE=-1;
double LWKOPT;
DGETRI( &N, A, &LDA, IPIV, &LWKOPT, &MONE, &INFO );
LWORK=(int)LWKOPT;

Please also point attention that in your example instead of
WORK=(double *)malloc(M*sizeof(double));
should be:
WORK=(double *)malloc(LWORK*sizeof(double));

--Alexander

Re: dgetri performance issues

hpc-matt — Tue, 10 Nov 2009 23:39:53 GMT

Quoting - Alexander Kobotov (Intel)

Thanks, that did improve my results substatially. I am working on getting VTune setup and working now. Thanks all for your help.

Matt