Increase chemistry calculation performances, EXPONENTIAL

benoit_leveugle · ‎07-06-2010

Hi,

I am trying to improve the performance of our calculation code.
One of the main CPU costly subroutine is the chemistry calculation (Arrhenius Law). It is a point to point exponetial calculation, so each point is fully independant and a simple generic test can shows waht performance we could expect.

I tried the MKL library, but performances are clearly worst than the simple mathematic function.

Here is the source code (in FORTRAN 90, but someone used to C/C++ can easily understand it) :

program test

implicit none

integer, parameter :: Nx1=100000000
real(8), dimension(1:Nx1) :: x,y
integer :: n
real(8) :: time1,time2,totaltime

y(:) = 0.0d0
do n=1,Nx1
x(n) = dcos(n*3.14d0/13.89567d0)
end do

call cpu_time(time1)
CALL vdExp(Nx1,x,y) !! INTEL MKL Subroutine
call cpu_time(time2)
totaltime=time2-time1
print *,"test 1 :",totaltime,sum(y)

y(:) = 0.0d0
do n=1,Nx1
x(n) = dcos(n*3.14d0/13.89567d0)
end do

call cpu_time(time1)
do n=1,Nx1
y(n) = dexp(x(n)) !! Standard Exponential function
end do
call cpu_time(time2)
totaltime=time2-time1
print *,"test 2 :",totaltime,sum(y)

end program

In fact, it compute Nx=100000000 pseudo random exponential.

I compiled with the following arguments (on a XEON Nehalem) :
ifort -O4 -xsse4.2 -mkl test.f90

And the results are the following :
test 1 : 1.35000000000000 126606589.225275
test 2 : 0.420000000000000 126606589.225275

It is clear that non MKL calculation is far better (3 times faster).

Do I made a mistake ?

Ben

Gennady_F_Intel · ‎07-06-2010

Hi Ben,

it's not clear how did you link mkl?

--Gennady

Gennady_F_Intel · ‎07-06-2010

Hi Ben,

Please forget my first message.(:- I 've missed one important point: the input vector size is very huge and hence, the behavior that you observe, the expected.

Please look at the the VML performance charthere.

See the middle chart Performance vs Vector length, Exp function for Intel Core i7 ( LLC size 8 MB).When the Vector length is around 10^6 we can see that CPE ceases to decrease and begins to increase.This is due to the fact that the vector size (Nx1* sizeof(double) ) exceeds the size of last level cache ( which is 8 Mb in this case).

--Gennady

TimP · ‎07-07-2010

This is one of the reasons for preferring compiler auto-vectorization (such as ifort does, using svml library). When you split loops explicitly and don't allow your intermediate results to reside in L1 cache, it is quite likely that any gain from an optimized library will be lost in cache misses. If you have a reason for not allowing compiler optimization and prefer the VML calls, you could try blocking your loops for cache locality.
If you require only 3 digits precision, why are you using double precision?

benoit_leveugle · ‎07-08-2010

Thank you for your answers.

I have found that poor performances where coming from the compilation. I had to use -mkl:sequential instead of -mkl only. I am using the serial version of the code. Do I need to do the same when I will run the MPI code (or use -mkl:parallel) ?

Now, the LA test is 34% faster than normal exponential for a cumulated error of 5e-10.

Gennady
Considering the size of the vector : in our calculations the size per processors (MPI calculations) is something like 20 000 to 100 000 points to be computed. It is clear it's far too much according to the performances graphs. I think I will try to cut calculations in parts to comply with the 1000-2000 points length vectors.
See below.

tim18
What do you mean by "If you have a reason for not allowing compiler optimization" ?
In or code, the calculation is made as followed (this exponential operation cost a lot) :

For each species (from 2 to ~250) :

do j=1,Nx2
do i=1,Nx1
W(i,j) = (A(i,j)**n) * (B(i,j)**m) * ... * dexp(T(i,j)*Cste)
end do
end do

So I was thinking about splitting like that :

call vdexp(i*j,T(:,:)*Cste,W(:,:))

do j=1,Nx2
do i=1,Nx1
W(i,j) = (A(i,j)**n) * (B(i,j)**m) * ... * w(i,j)
end do
end do

Gennady
And splitting to comply with the performances graphs : (but I am not sure about that)
If we consider that the size of i is between 128 and 512, I thinks it's better.

do j=1,Nx2
call vdexp(i,T(:,j)*Cste,W(:,j))
do i=1,Nx1
W(i,j) = (A(i,j)**n) * (B(i,j)**m) * ... * w(i,j)
end do
end do

tim18
In fact, I need something like 1e-10 precision to prevent the CFD code to explode, that is why I use double precision. I did not voluntary used non accurate values in the test. :)

TimP · ‎07-08-2010

mkl_sequential library is likely to be appropriate under MPI, if you don't see a benefit for MKL threading when running on a single node. This decision depends on many factors which haven't been discussed here.
I wished to point out that what you call "normal exponential" might well be optimized into svml vector calls by ifort, giving performance competitive with VML, without incurring the cache locality problem.

benoit_leveugle · ‎07-09-2010

I am sorry, but I fail to find how to use svml :(

Do I need to install a complementary library (like I did for MKL), or do I need to add something specific at the compilation ?
I have found the page concerning C/C++ :
http://software.intel.com/en-us/articles/how-to-implement-the-short-vector-math-library/

I am really interested with this smvl if the prototype of the function is the same as the "normal exponential" function, but I cannot find anything valuable concerning Fortran, except the name of the library : libsvml.a

Gennady_F_Intel · ‎07-09-2010

SVML ( short vector math lib) is not a part of MKL, but part of Intel Compiler.

Look into ia32intrin.h, you can find there all API. You can find more details into Compiler documentation also.

As an example, instead of vdExp() it will be _mm_cexp_ps(__m128 v1);

But, as Tim18 already told, if you will use intel compiler you dont care about svml routes because of in your particularly cases, Intel complier will use svml routines. You can check it if you look in asm code.

--Gennady

benoit_leveugle · ‎07-12-2010

Sorry for the delay, I was not able to go on Internet this weekend.

OK, I understand how it works now, it just replace the standard exponential routine. But even with this SVML, I still got better performances with the vdexp subroutine, so I think I will use this one.

I still just have one question : when should I use -mkl:sequential, -mkl:cluster and -mkl:parallel ? I didn't find how to choose which one is the better...