- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Ok, so I'm doing some tests with blas routines. So I wrote a test program in fortran 95 that performs 2 millions of dot products in 3 ways: using the blas routine 'sdot'; using the intrinsic fortran routine 'dot_product'; and a function of mine called 'dot'. The codes are below:
--------------------------------------
program teste_blas
implicit none
integer i
real v21(3),v23(3),r212,r232,r21,r23,time,sdot
external sdot
call random_number(v21)
call random_number(v23)
do i=1,1000000
r212=sdot(3,v21,1,v21,1)
r232=sdot(3,v23,1,v23,1)
end do
call cpu_time(time)
write(*,*) 'Products = ',r212,r232
write(*,*) 'CPU time = ',time
end
-------------------------------------------
program teste
implicit none
integer i
real v21(3),v23(3),r212,r232,r21,r23,time
call random_number(v21)
call random_number(v23)
do i=1,1000000
r212=dot_product(v21,v21)
r232=dot_product(v23,v23)
end do
call cpu_time(time)
write(*,*) 'Products = ',r212,r232
write(*,*) 'CPU time = ',time
end
---------------------------------------------------
program teste
implicit none
integer i
real v21(3),v23(3),r212,r232,r21,r23,time,dot
call random_number(v21)
call random_number(v23)
do i=1,1000000
r212=dot(v21,v21)
r232=dot(v23,v23)
end do
call cpu_time(time)
write(*,*) 'Products = ',r212,r232
write(*,*) 'CPU time = ',time
end
real function dot(v1,v2)
implicit none
real v1(3),v2(3)
dot = v1(1)*v2(1) + v1(2)*v2(2) + v1(3)*v2(3)
end function
-----------------------------------------------
The first one is compiled with
ifort teste-blas.f90 -L/opt/intel/mkl/8.1/lib/em64t -lmkl -lguide -lpthread
while the last two are compiled simply with
ifort teste.f90
The CPU times are intriguing. With the blas function 'sdot' the program takes 0.48s while with the intrinsic 'dot_product' it takes only 0.004s. Another intriguing thing is that using my own function 'dot' it takes the same 0.004s showing that the intrinsic fortran function must do quite the same behind the covers.
Question is: why does the blas routine such a bad job when it was supposed to improve performance over normal routines?
Am I doing something wrong during the compilation?
The system I'm running these tests on is an AMD64 dual core, and I'm using the intel fortran 9.1.036 and mkl 8.1.
Thanks in advance,
Herbert
--------------------------------------
program teste_blas
implicit none
integer i
real v21(3),v23(3),r212,r232,r21,r23,time,sdot
external sdot
call random_number(v21)
call random_number(v23)
do i=1,1000000
r212=sdot(3,v21,1,v21,1)
r232=sdot(3,v23,1,v23,1)
end do
call cpu_time(time)
write(*,*) 'Products = ',r212,r232
write(*,*) 'CPU time = ',time
end
-------------------------------------------
program teste
implicit none
integer i
real v21(3),v23(3),r212,r232,r21,r23,time
call random_number(v21)
call random_number(v23)
do i=1,1000000
r212=dot_product(v21,v21)
r232=dot_product(v23,v23)
end do
call cpu_time(time)
write(*,*) 'Products = ',r212,r232
write(*,*) 'CPU time = ',time
end
---------------------------------------------------
program teste
implicit none
integer i
real v21(3),v23(3),r212,r232,r21,r23,time,dot
call random_number(v21)
call random_number(v23)
do i=1,1000000
r212=dot(v21,v21)
r232=dot(v23,v23)
end do
call cpu_time(time)
write(*,*) 'Products = ',r212,r232
write(*,*) 'CPU time = ',time
end
real function dot(v1,v2)
implicit none
real v1(3),v2(3)
dot = v1(1)*v2(1) + v1(2)*v2(2) + v1(3)*v2(3)
end function
-----------------------------------------------
The first one is compiled with
ifort teste-blas.f90 -L/opt/intel/mkl/8.1/lib/em64t -lmkl -lguide -lpthread
while the last two are compiled simply with
ifort teste.f90
The CPU times are intriguing. With the blas function 'sdot' the program takes 0.48s while with the intrinsic 'dot_product' it takes only 0.004s. Another intriguing thing is that using my own function 'dot' it takes the same 0.004s showing that the intrinsic fortran function must do quite the same behind the covers.
Question is: why does the blas routine such a bad job when it was supposed to improve performance over normal routines?
Am I doing something wrong during the compilation?
The system I'm running these tests on is an AMD64 dual core, and I'm using the intel fortran 9.1.036 and mkl 8.1.
Thanks in advance,
Herbert
Link Copied
4 Replies
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I'm sure it is written somewhere that you can prove anything with a benchmark. In your benchmark, in the latter two cases, the compiler can see that only a single loop iteration is required, while BLAS was never advertised as a way to optimize a short dot product. It generally takes several thousand operations per call to a BLAS function to make it competitive with in-line code.
In case it matters, cpu_time is generally recognized as not having a precise start time (consult one of the Metcalf et al "fortran nn/nn explained" texts), and it should time in increments of 0.010 second. In your fast cases, you should find the same result from cpu_time if you move it ahead of your pseudo-loop.
In case it matters, cpu_time is generally recognized as not having a precise start time (consult one of the Metcalf et al "fortran nn/nn explained" texts), and it should time in increments of 0.010 second. In your fast cases, you should find the same result from cpu_time if you move it ahead of your pseudo-loop.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Thanks for the tips tim.
I have done some more rigorous tests, averaging out the times over 5 runs, and run for vectors of size 1 to 300.
It seems in my case that, for vectors with up to 40 elements, it is not advisable to use the blas routine. Additionally, when I generalized the vector product inside my own routine for any size, by placing a loop in it, it did a worse job than the intrinsic dot_product routine for all vector sizes.
I have done some more rigorous tests, averaging out the times over 5 runs, and run for vectors of size 1 to 300.
It seems in my case that, for vectors with up to 40 elements, it is not advisable to use the blas routine. Additionally, when I generalized the vector product inside my own routine for any size, by placing a loop in it, it did a worse job than the intrinsic dot_product routine for all vector sizes.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Tim has answered your questions quite well. Your observation about a break-even point is well taken. I would also add that for simple functions like the dot product, compilers should produce highly optimized code. In fact, MKL uses the compiler on some of the level 1 and level 2 BLAS.
Bruce
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Thanks MAD. I'll keep this in mind.
Reply
Topic Options
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page