optimization?

hcgeorg · ‎07-01-2007

Ok, so I'm doing some tests with blas routines. So I wrote a test program in fortran 95 that performs 2 millions of dot products in 3 ways: using the blas routine 'sdot'; using the intrinsic fortran routine 'dot_product'; and a function of mine called 'dot'. The codes are below:

--------------------------------------
program teste_blas

implicit none

integer i
real v21(3),v23(3),r212,r232,r21,r23,time,sdot

external sdot

call random_number(v21)
call random_number(v23)

do i=1,1000000
r212=sdot(3,v21,1,v21,1)
r232=sdot(3,v23,1,v23,1)
end do

call cpu_time(time)

write(*,*) 'Products = ',r212,r232
write(*,*) 'CPU time = ',time

end
-------------------------------------------
program teste

implicit none

integer i
real v21(3),v23(3),r212,r232,r21,r23,time

call random_number(v21)
call random_number(v23)

do i=1,1000000
r212=dot_product(v21,v21)
r232=dot_product(v23,v23)
end do

call cpu_time(time)

write(*,*) 'Products = ',r212,r232
write(*,*) 'CPU time = ',time

end
---------------------------------------------------
program teste

implicit none

integer i
real v21(3),v23(3),r212,r232,r21,r23,time,dot

call random_number(v21)
call random_number(v23)

do i=1,1000000
r212=dot(v21,v21)
r232=dot(v23,v23)
end do

call cpu_time(time)

write(*,*) 'Products = ',r212,r232
write(*,*) 'CPU time = ',time

end

real function dot(v1,v2)

implicit none

real v1(3),v2(3)

dot = v1(1)*v2(1) + v1(2)*v2(2) + v1(3)*v2(3)

end function
-----------------------------------------------

The first one is compiled with
ifort teste-blas.f90 -L/opt/intel/mkl/8.1/lib/em64t -lmkl -lguide -lpthread

while the last two are compiled simply with
ifort teste.f90

The CPU times are intriguing. With the blas function 'sdot' the program takes 0.48s while with the intrinsic 'dot_product' it takes only 0.004s. Another intriguing thing is that using my own function 'dot' it takes the same 0.004s showing that the intrinsic fortran function must do quite the same behind the covers.

Question is: why does the blas routine such a bad job when it was supposed to improve performance over normal routines?
Am I doing something wrong during the compilation?

The system I'm running these tests on is an AMD64 dual core, and I'm using the intel fortran 9.1.036 and mkl 8.1.

Thanks in advance,
Herbert

TimP · ‎07-01-2007

I'm sure it is written somewhere that you can prove anything with a benchmark. In your benchmark, in the latter two cases, the compiler can see that only a single loop iteration is required, while BLAS was never advertised as a way to optimize a short dot product. It generally takes several thousand operations per call to a BLAS function to make it competitive with in-line code.
In case it matters, cpu_time is generally recognized as not having a precise start time (consult one of the Metcalf et al "fortran nn/nn explained" texts), and it should time in increments of 0.010 second. In your fast cases, you should find the same result from cpu_time if you move it ahead of your pseudo-loop.

hcgeorg · ‎07-02-2007

Thanks for the tips tim.

I have done some more rigorous tests, averaging out the times over 5 runs, and run for vectors of size 1 to 300.
It seems in my case that, for vectors with up to 40 elements, it is not advisable to use the blas routine. Additionally, when I generalized the vector product inside my own routine for any size, by placing a loop in it, it did a worse job than the intrinsic dot_product routine for all vector sizes.

Intel_C_Intel · ‎07-02-2007

Tim has answered your questions quite well. Your observation about a break-even point is well taken. I would also add that for simple functions like the dot product, compilers should produce highly optimized code. In fact, MKL uses the compiler on some of the level 1 and level 2 BLAS.

Bruce

hcgeorg · ‎07-02-2007

Thanks MAD. I'll keep this in mind.