Solved: Intel VML slow

sdgkgp · ‎02-14-2011

Hi,

I am a newbie so please bear with me if I provide irrelevant details.

I am trying to achieve the speeds reported in:
http://software.intel.com/sites/products/documentation/hpc/mkl/vml/functions/_performanceall.html

for the log function vsLn()

My simple C script containsjust one call to vsLn()

I compile it on windows using:

g++ -I"C:/PROGRA~1/R/R-212~1.1/include" -I"C:/Progra~1/Intel/ComposerXE-2011/mkl
/include" -O2 -Wall -c MKLvml_main.cc -o MKLvml_main.o

g++ -shared -s -static-libgcc -o MKLvml.dll tmp.def MKLvml_main.o C:/Progra~1/In
tel/ComposerXE-2011/mkl/lib/ia32/mkl_intel_c_dll.lib C:/Progra~1/Intel/ComposerX
E-2011/mkl/lib/ia32/mkl_sequential_dll.lib C:/Progra~1/Intel/ComposerXE-2011/mkl
/lib/ia32/mkl_core_dll.lib C:/Progra~1/Intel/ComposerXE-2011/mkl/lib/ia32/mkl_rt
.lib -LC:/PROGRA~1/R/R-212~1.1/bin/i386 -lR

As you can see I am using sequential library. I also tried parallel and the results are the same.

Can someone please suggest what I can do to improve the speed?

Currently 10^8 log operations (in a loop of 10^3 iterations each computing the log of a 10^5 long vector) takes around 6s. Expected is less than .5s.

( The results I am getting are just 2x improvement over the default log calculation. I am working inside R just FYI.)

Thanks.

Andrey_G_Intel2 · ‎02-16-2011

We will try to reproduce your situation. But I can say right now, that you measured not vsLn performance only. You measured overheads for MKL dlls loading, call to vmlSetMode and maybe some other overheads were included to your measurements.

Andrey

View solution in original post

Andrey_G_Intel2 · ‎02-14-2011

Hi sdgkgp!

could you provide little bit more details?
1) your sample program will be helpful for us
2) at with CPU you are running your sample?

Andrey

Gennady_F_Intel · ‎02-14-2011

We also need to know the exact version of mkl you are using. Could please let us know the Package ID?

You can find it in the mklsupport.txt file ( \Documentation\ )

--Gennady

sdgkgp · ‎02-15-2011

Hi Andrey,

1) My C code is as follows:

#include
#include "R.h"
#include "Rmath.h"
#include "mkl_vml.h"
#include "mkl.h"

extern "C" {

void get_mkl_log(float *fB, int *Blen, float *fA, int *Alen){

vmlSetMode(VML_EP);
MKL_INT vec_len = Alen[0];
vsLn(vec_len, fA, fB);

return;
}

}

As you can see there are some R header files which are for enabling R to talk with C++

2) I am using Intel Core 2 Quad CPU Q9400 @ 2.66GHz

Please let me know what more details I can provide.

sdgkgp · ‎02-15-2011

Hello Gennady,

It is

Package ID: w_mkl_10.3.2.154 w_ccompxe_2011.2.154 w_fcompxe_2011.2.154

Thanks again for looking into this. Looking forward to your reply.

Andrey_G_Intel2 · ‎02-15-2011

sdgkgp,

could you provide full example? It will help us to give exact and quick answer. We also need to know how you fill input vector, how you are doing performance measurements and etc.

Andrey

sdgkgp · ‎02-15-2011

As I mentioned, this is done inside R:

dyn.load("C:/RPackages/MKLvml/src/MKLvml.dll")
N = 1e3
in_vec = as.single( runif(N) ) # generates random uniform numbers between 0 and 1
out_vec = as.single( vector("numeric",N) ) # allocated mempry to out_vec

system.time( # for performance measurement (time taken)
for (i in 1:1e5)
{
t <- .C("get_mkl_log", dB = out_vec, Blen = as.integer(N), dA = in_vec, Alen = as.integer(N) ) # actual call
}
)

The output I get is:

user system elapsed
4.53 0.00 4.54

which means 4.54s were taken by the core process.

sdgkgp · ‎02-15-2011

The example above shows that the computation is being done at:

1e8 * 3.01 / 4.54 = .066 Ghz

while my CPU is 2.66 Ghz

( 1e8 log operations each consuming 3.01 cycles as given in the performance docs for vsLn in EP mode )

Andrey_G_Intel2 · ‎02-16-2011

We will try to reproduce your situation. But I can say right now, that you measured not vsLn performance only. You measured overheads for MKL dlls loading, call to vmlSetMode and maybe some other overheads were included to your measurements.

Andrey

barragan_villanueva_ · ‎02-16-2011

Hi,

Your linking line:

g++ -shared -s -static-libgcc -o MKLvml.dll tmp.def MKLvml_main.o C:/Progra~1/In
tel/ComposerXE-2011/mkl/lib/ia32/mkl_intel_c_dll.lib C:/Progra~1/Intel/ComposerX
E-2011/mkl/lib/ia32/mkl_sequential_dll.lib C:/Progra~1/Intel/ComposerXE-2011/mkl
/lib/ia32/mkl_core_dll.lib C:/Progra~1/Intel/ComposerXE-2011/mkl/lib/ia32/mkl_rt
.lib -LC:/PROGRA~1/R/R-212~1.1/bin/i386 -lR

used sequential library together with mkl_rt :( You'd beter use one linking model.
Please try MKL Link Line Advisor

But for eliminating overhead on loading dynamic libraies please use only static libraies if possible.

sdgkgp · ‎02-16-2011

Thank you everyone for your answers.

I was making mistake in performance evaluation.

It turns out R has a lot of overhead when communicating data to C and that is why it is so slow.

When I compute the timing from inside C, the numbers match with those reported in the performance docs.

Sergey_M_Intel2 · ‎02-16-2011

Hi sdgkgp,

It still makes sense to understand why R calling overhead of the third-party DLL is so big. We will experiment on our side and report back. If you also havesome interestingfindings on your side, we will be happy if you let us know about those.

Many thanks for your interest,
Sergey

Andrey_N_Intel · ‎02-24-2011

Hello Sdgkgp,

You seem to use .C function in your R application:
t <- .C("get_mkl_log", dB = out_vec, Blen = as.integer(N), dA = in_vec, Alen = as.integer(N) ) # actual call

According to Section 5.2 of the document "Writing R extensions" available at http://cran.r-project.org/doc/manuals/R-exts.html.C function can introduce an additional argument overhead: "Unless formal argument NAOK is true, all the other arguments are checked for missing values NA and for the IEEE special values NaN, Inf and -Inf, and the presence of any of these generates an error."

You might want to try .External or .Call functions as alternative. Hope this would help.

Thanks,
Andrey