Turn on suggestions

Auto-suggest helps you quickly narrow down your search results by suggesting possible matches as you type.

Showing results for

- Intel Community
- Software
- Software Development SDKs and Libraries
- Intel® oneAPI Math Kernel Library
- Increase chemistry calculation performances, EXPONENTIAL

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Mute
- Printer Friendly Page

benoit_leveugle

Beginner

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

07-06-2010
10:51 AM

56 Views

Increase chemistry calculation performances, EXPONENTIAL

I am trying to improve the performance of our calculation code.

One of the main CPU costly subroutine is the chemistry calculation (Arrhenius Law). It is a point to point exponetial calculation, so each point is fully independant and a simple generic test can shows waht performance we could expect.

I tried the MKL library, but performances are clearly worst than the simple mathematic function.

Here is the source code (in FORTRAN 90, but someone used to C/C++ can easily understand it) :

program test

implicit none

integer, parameter :: Nx1=100000000

real(8), dimension(1:Nx1) :: x,y

integer :: n

real(8) :: time1,time2,totaltime

y(:) = 0.0d0

do n=1,Nx1

x(n) = dcos(n*3.14d0/13.89567d0)

end do

call cpu_time(time1)

CALL vdExp(Nx1,x,y) !! INTEL MKL Subroutine

call cpu_time(time2)

totaltime=time2-time1

print *,"test 1 :",totaltime,sum(y)

y(:) = 0.0d0

do n=1,Nx1

x(n) = dcos(n*3.14d0/13.89567d0)

end do

call cpu_time(time1)

do n=1,Nx1

y(n) = dexp(x(n)) !! Standard Exponential function

end do

call cpu_time(time2)

totaltime=time2-time1

print *,"test 2 :",totaltime,sum(y)

end program

In fact, it compute Nx=

I compiled with the following arguments (on a XEON Nehalem) :

ifort -O4 -xsse4.2 -mkl test.f90

And the results are the following :

test 1 : 1.35000000000000 126606589.225275

test 2 : 0.420000000000000 126606589.225275

It is clear that non MKL calculation is far better (3 times faster).

Do I made a mistake ?

Ben

Link Copied

8 Replies

Gennady_F_Intel

Moderator

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

07-06-2010
09:43 PM

56 Views

Hi Ben,

it's not clear how did you link mkl?

--Gennady

Gennady_F_Intel

Moderator

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

07-06-2010
11:23 PM

56 Views

Hi Ben,

Please forget my first message.(:- I 've missed one important point: the input vector size is very huge and hence, the behavior that you observe, the expected.

Please look at the the VML performance charthere.

See the
middle chart Performance vs Vector length, Exp function for Intel Core i7 (
LLC size 8 MB).When the
Vector length is around 10^6 we can see that CPE ceases to
decrease and begins to increase.This is due to the fact that the vector size (*Nx1** sizeof(double)
) exceeds the size of last level cache ( which is 8 Mb in this case).

--Gennady

TimP

Black Belt

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

07-07-2010
08:03 AM

56 Views

If you require only 3 digits precision, why are you using double precision?

benoit_leveugle

Beginner

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

07-08-2010
04:27 AM

56 Views

I have found that poor performances where coming from the compilation. I had to use -mkl:sequential instead of -mkl only. I am using the serial version of the code. Do I need to do the same when I will run the MPI code (or use -mkl:parallel) ?

Now, the LA test is 34% faster than normal exponential for a cumulated error of 5e-10.

Considering the size of the vector : in our calculations the size per processors (MPI calculations) is something like 20 000 to 100 000 points to be computed. It is clear it's far too much according to the performances graphs. I think I will try to cut calculations in parts to comply with the 1000-2000 points length vectors.

See below.

What do you mean by "If you have a reason for not allowing compiler optimization" ?

In or code, the calculation is made as followed (this exponential operation cost a lot) :

For each species (from 2 to ~250) :

do i=1,Nx1

W(i,j) = (A(i,j)**n) * (B(i,j)**m) * ... * dexp(T(i,j)*Cste)

end do

end do

So I was thinking about splitting like that :

do j=1,Nx2

do i=1,Nx1

W(i,j) = (A(i,j)**n) * (B(i,j)**m) * ... * w(i,j)

end do

end do

If we consider that the size of i is between 128 and 512, I thinks it's better.

do j=1,Nx2

W(i,j) = (A(i,j)**n) * (B(i,j)**m) * ... * w(i,j)

end do

end do

In fact, I need something like 1e-10 precision to prevent the CFD code to explode, that is why I use double precision. I did not voluntary used non accurate values in the test. :)

TimP

Black Belt

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

07-08-2010
08:20 AM

56 Views

I wished to point out that what you call "normal exponential" might well be optimized into svml vector calls by ifort, giving performance competitive with VML, without incurring the cache locality problem.

benoit_leveugle

Beginner

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

07-09-2010
12:37 AM

56 Views

Do I need to install a complementary library (like I did for MKL), or do I need to add something specific at the compilation ?

I have found the page concerning C/C++ :

http://software.intel.com/en-us/articles/how-to-implement-the-short-vector-math-library/

I am really interested with this smvl if the prototype of the function is the same as the "normal exponential" function, but I cannot find anything valuable concerning Fortran, except the name of the library : libsvml.a

Gennady_F_Intel

Moderator

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

07-09-2010
06:11 AM

56 Views

SVML ( short vector math lib) is not a part of MKL, but part of Intel Compiler.

Look into ia32intrin.h, you can find there all API. You can find more details into Compiler documentation also.

As an example, instead of vdExp() it will be _mm_cexp_ps(__m128 v1);

But, as Tim18 already told, if you will use intel compiler you dont care about svml routes because of in your particularly cases, Intel complier will use svml routines. You can check it if you look in asm code.

--Gennady

benoit_leveugle

Beginner

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

07-12-2010
04:49 AM

56 Views

OK, I understand how it works now, it just replace the standard exponential routine. But even with this SVML, I still got better performances with the vdexp subroutine, so I think I will use this one.

I still just have one question : when should I use -mkl:sequential, -mkl:cluster and -mkl:parallel ? I didn't find how to choose which one is the better...

Topic Options

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page

For more complete information about compiler optimizations, see our Optimization Notice.