- Marcar como novo
- Marcador
- Subscrever
- Silenciar
- Subscrever fonte RSS
- Destacar
- Imprimir
- Denunciar conteúdo inapropriado
Hi all,
I have some questions about the recursive functions performance in the compiler.
The code is below, the performance table obtains with the compiler 14. First column is normal, second column is pass by value.
Ifort is slower than gfortran(4.8.2), and is passing by reference creating more copies in the recursive case ?
I don't know if performance is better with old compiler version.
Thanks.
ifort 3.51 1.151
gfortran 2.027 0.452
program ackermann implicit none write(*,*) ack(3, 12) contains recursive function ack(m, n) result(a) integer, intent(in) :: m,n integer :: a if (m == 0) then a=n+1 else if (n == 0) then a=ack(m-1,1) else a=ack(m-1, ack(m, n-1)) end if end function ack end program ackermann
- Marcas:
- Intel® Fortran Compiler
Link copiado
- Marcar como novo
- Marcador
- Subscrever
- Silenciar
- Subscrever fonte RSS
- Destacar
- Imprimir
- Denunciar conteúdo inapropriado
Internal procedures are semantically distinct from module procedures and BIND(C) procedures may handle the VALUE attribute differently as well. Have you measured the effect of making function ack a module procedure and of giving it the BIND(C) attribute (8 tests in all)?
- Marcar como novo
- Marcador
- Subscrever
- Silenciar
- Subscrever fonte RSS
- Destacar
- Imprimir
- Denunciar conteúdo inapropriado
Putting the function in a module with or without bind(C) doesn't change the results.
The gap of performance between value and no value , and between ifort and gfortran is still there.
- Marcar como novo
- Marcador
- Subscrever
- Silenciar
- Subscrever fonte RSS
- Destacar
- Imprimir
- Denunciar conteúdo inapropriado
I tried this with gfortran 4.8.2 and ifort 14.0.2 was faster by about 10%. But this is hardly representative of a real application. Is this just a question of academic interest?
- Marcar como novo
- Marcador
- Subscrever
- Silenciar
- Subscrever fonte RSS
- Destacar
- Imprimir
- Denunciar conteúdo inapropriado
Can you please provide your options ? in which case ifort was faster w/witout value?
I'm running on a Intel(R) Core(TM) i7-3520M, i compile with -O3 and fast.
No, by pure curiosity and interest for the language , and as I'm used to see ifort being faster I was surprise. Also I'd like to know why in this case passing by value is faster, so maybe i can use it in my real codes.
Thanks.
- Marcar como novo
- Marcador
- Subscrever
- Silenciar
- Subscrever fonte RSS
- Destacar
- Imprimir
- Denunciar conteúdo inapropriado
I just used -O2 on a Core i7 965 (Nehalem). I didn't try adding VALUE. You didn't show how you changed the code to pass and receive by value - I would not normally expect that to make any noticeable difference. Perhaps something else is going on.
Were you building 32-bit or 64-bit? If 64-bit, I could understand VALUE being faster due to the arguments being passed in registers rather than on the stack.
- Marcar como novo
- Marcador
- Subscrever
- Silenciar
- Subscrever fonte RSS
- Destacar
- Imprimir
- Denunciar conteúdo inapropriado
I am on a 64 bit. here is how i put by value :
integer, intent(in),value :: m,n
without value , gfortran -O2 gives 3.598s with -O3 2.022s .
with value , gfortran -O2 gives 0.767s with -O3 0.458s .
without value , ifort -O2 gives m3.579s with -O3 3.572s .
with value , ifort -O2 gives 0.964s with -O3 0.972s
Pat.
- Marcar como novo
- Marcador
- Subscrever
- Silenciar
- Subscrever fonte RSS
- Destacar
- Imprimir
- Denunciar conteúdo inapropriado
With value attribute, the m-1 and n-1 in "a=ack(m-1, ack(m, n-1))" need not be written to temporary stack variable in caller (to be passed by reference to callee). It should eliminate two writes and two reads per call (assuming fast call being used, x64 uses fast call by default, x32 depends on options).
Jim Dempsey
- Marcar como novo
- Marcador
- Subscrever
- Silenciar
- Subscrever fonte RSS
- Destacar
- Imprimir
- Denunciar conteúdo inapropriado
Timing a function like this example bad practice since it leads one to bad assumptions (generalities) about optimizations in total. If you are really concerned about performance... for this function... then you would make it an iterative process as opposed to a recursive process. Doing so would make it much more efficient (execution wise).
The following is a C++ example illustrating this (I will let you convert this to Fortran):
// Serial recursive method to calculate Fibonacci series long SerialFib( long n ) { if( n<2 ) return n; else return SerialFib(n-1)+SerialFib(n-2); } // Serial non-recursive method to calculate Fibonacci series // *** Note, several orders of magnitude faster than all // *** recursive techniques. long SerialFib2( long n ) { if( n<2 ) return n; long fib0 = 1; // most recent long fib1 = 1; // 1 prior to most recent long fib2 = 0; // 2 prior to most recent long i; for(i=2; i<n; ++i) { fib2 = fib1; fib1 = fib0; fib0 = fib1 + fib2; } return fib0; }
Jim Dempsey
- Marcar como novo
- Marcador
- Subscrever
- Silenciar
- Subscrever fonte RSS
- Destacar
- Imprimir
- Denunciar conteúdo inapropriado
That's not enough to pass by value in standard Fortran. ifort has a bug where it DOES cause pass/receive by value, I didn't think gfortran has the same bug. We do it the right way if you say -standard-semantics, which is to pass an anonymous copy by reference. You'd need to add BIND(C) to have VALUE mean pass-by-value. Looking at gfortran's generated code, with VALUE, I do see it putting the argument directly into the registers rather than a temp. Maybe it figures that it can see the whole program so it's safe to do so.
That said, I do see that adding VALUE does significantly improve the performance if -standard-semantics is not used.
Anyway, here are the timings I get:
gfortran with and without VALUE: 5.700 6.055
ifort with and without VALUE: 1.438 4.714
How did you do your timings? I added code to the program using SYSTEM_CLOCK like this:
program ackermann implicit none integer(8) start, end, rate write (*,*) "Start" call system_clock(start) write(*,*) ack(3,12) ! 12) call system_clock (end, rate) write (*,*) "Time: ", real(end-start)/real(rate) contains recursive function ack(m, n) result(a) integer, intent(in) :: m,n integer :: a if (m == 0) then a=n+1 else if (n == 0) then a=ack(m-1,1) else a=ack(m-1, ack(m, n-1)) end if end function ack end program ackermann
- Marcar como novo
- Marcador
- Subscrever
- Silenciar
- Subscrever fonte RSS
- Destacar
- Imprimir
- Denunciar conteúdo inapropriado
Thanks for the explanation . Actually I'm not sure how easy the ackermann function is to serialize. Also i understand changing to a faster algorithm, but that's not the point here.
I use my time shell function , but here are the result i got with your code :
gfortran -O3 test.f90
./a.out
Start
32765
Time: 2.02059364
ifort -O3 test.f90
./a.out
Start
32765
Time: 3.741407
With value :
gfortran -O3 test.f90
./a.out
Start
32765
Time: 0.458827287
ifort -O3 test.f90
./a.out
Start
32765
Time: 1.066044
gfortran -v
Using built-in specs.
COLLECT_GCC=gfortran
COLLECT_LTO_WRAPPER=/usr/lib/gcc/x86_64-unknown-linux-gnu/4.8.2/lto-wrapper
Target: x86_64-unknown-linux-gnu
Configured with: /build/gcc/src/gcc-4.8-20140206/configure --prefix=/usr --libdir=/usr/lib --libexecdir=/usr/lib --mandir=/usr/share/man --infodir=/usr/share/info --with-bugurl=https://bugs.archlinux.org/ --enable-languages=c,c++,ada,fortran,go,lto,objc,obj-c++ --enable-shared --enable-threads=posix --with-system-zlib --enable-__cxa_atexit --disable-libunwind-exceptions --enable-clocale=gnu --disable-libstdcxx-pch --disable-libssp --enable-gnu-unique-object --enable-linker-build-id --enable-cloog-backend=isl --disable-cloog-version-check --enable-lto --enable-plugin --enable-install-libiberty --with-linker-hash-style=gnu --disable-multilib --disable-werror --enable-checking=release
Thread model: posix
gcc version 4.8.2 20140206 (prerelease) (GCC)
ifort -v
ifort version 14.0.1
- Marcar como novo
- Marcador
- Subscrever
- Silenciar
- Subscrever fonte RSS
- Destacar
- Imprimir
- Denunciar conteúdo inapropriado
Wrote up a Fortran variation:
! Fibonacci.f90 module Fibs contains recursive integer(8) function SerialFib( n ) implicit none integer :: n if( n .lt. 2) then SerialFib = n else SerialFib = SerialFib(n-1) + SerialFib(n-2) endif end function SerialFib integer(8) function SerialFib2( n ) implicit none integer :: n, i integer(8) :: fib0, fib1, fib2 if(n .lt. 2) then SerialFib2 = n else fib0 = 1 fib1 = 1 fib2 = 0 do i=2,n-1 fib2 = fib1 fib1 = fib0 fib0 = fib1 + fib2 enddo SerialFib2 = fib0 endif end function SerialFib2 end module Fibs program Fibonacci use omp_lib ! omp_get_wtime() use Fibs implicit none real(8) :: t1, t2, t3 integer(8) :: fib, fib2 integer :: in in = 45 t1 = omp_get_wtime() fib = SerialFib(in) t1 = omp_get_wtime() - t1 print *, in, fib, t1 t2 = omp_get_wtime() fib2 = SerialFib2(in) t2 = omp_get_wtime() - t2 print *, in, fib2, t2 in = in * 2 t3 = omp_get_wtime() fib2 = SerialFib2(in) t3 = omp_get_wtime() - t3 print *, in, fib2, t3 end program Fibonacci
45 1134903170 9.69680809543934 45 1134903170 0.000000000000000E+000 90 2880067194370816120 3.001186996698380E-007 Press any key to continue . . .
Running recursive with input of 45 took 9.7 seconds, iterative was too short to measure.
Input of 90 would take too long to run recursively, and would likely exhaust the stack, the iterative method, barely measurable using omp_get_wtime().
This example, is computationally equivalent to the example provided by Ackermann
Jim Dempsey
- Marcar como novo
- Marcador
- Subscrever
- Silenciar
- Subscrever fonte RSS
- Destacar
- Imprimir
- Denunciar conteúdo inapropriado
pat97269 wrote:
Actually I'm not sure how easy the ackermann function is to serialize.
I remember my sophomore CS instructor giving us an assignment to write a serial version of the Ackermann function. Drove me crazy. I never forgave him when he later told us that it was not possible. See http://en.wikipedia.org/wiki/Ackermann_function
- Marcar como novo
- Marcador
- Subscrever
- Silenciar
- Subscrever fonte RSS
- Destacar
- Imprimir
- Denunciar conteúdo inapropriado
I clearly understand some algorithms are faster than other. But that doesn't remove the point that in that case, on my processor, the performance are surprising, and as i understand passing by value in this case allow speedup the computation a little bit. It doesn't explain the difference in performance with steve result.
I will try to run that on other processor and see.
Thanks
- Marcar como novo
- Marcador
- Subscrever
- Silenciar
- Subscrever fonte RSS
- Destacar
- Imprimir
- Denunciar conteúdo inapropriado
For what it is worth, here is an assembler version of the ack(m,n) function, callable from C. Other than depending on the recursion relations to compute the function, it contains 16 instructions, of which only three access memory and consume stack. On a C2D E8400 3 GHz CPU OpenSuse system, the program computed ack(3,12) in 1.4 seconds.
[bash]
# int ack(int m,int n), 64-bit GCC calling convention
#
.text
.globl ack
ack: # m in edi, n in esi
test %edi,%edi # m==0?
jz l1
test %esi,%esi # n==0?
jz l2
dec %esi # n--
push %rdi # save m; no need to save n
call ack # ack(m,n-1)
pop %rdi # restore m
mov %eax, %esi # ack(m,n-1)
dec %edi # m-1
jmp ack # ack(m-1,ack(m,n-1))
l1:
leal 1(%esi),%eax # m=0: return n+1
ret
l2:
dec %edi # n=0; return ack(m-1,1)
inc %esi
jmp ack
#
#END
[/bash]Compiler options are irrelevant since the driver (see below) consumes very little time (parsing the command arguments, calling the function, and printing the result).
[cpp]#include <stdio.h>
#include <stdlib.h>
extern int ack(int m,int n);
main(int argc,char *argv[]){
int m,n;
if(argc-3){
fprintf(stderr,"Usage: ackd <m> <n>\n");
exit(1);
}
m=atoi(argv[1]); n=atoi(argv[2]);
printf("%2d %2d %d\\n",m,n,ack(m,n));
}[/cpp]
- Marcar como novo
- Marcador
- Subscrever
- Silenciar
- Subscrever fonte RSS
- Destacar
- Imprimir
- Denunciar conteúdo inapropriado
mecej4,
Great post. In looking at the assembler code, I think one can convert it into a arguable iterative process. Here is my thought (not tested in code)
On entry to ack: you copy %esp to %ecx (or rcx when coding x64)
After that you add tag ack_loop:
Replace the "call ack" with "jmp ack_loop"
Move the code following the former "call ack" to in front of the "ret"
Add following the above moved code (before the ret) "pop %rdi; cmp %ecx,%esp; jnz ack_loop"
Effectively what you are doing is replacing the call stack with a data stack. The function code is iterative, however it uses a stack-like data structure (that resides on the stack for storage).
So does this then exemplifie an iterative process that implements the Ackerman's function?
--- Edit ---
Untested code:
# int ack(int m,int n), 64-bit GCC calling convention # .text .globl ack ack: # m in edi, n in esi mov %esp,%ecx ack_loop: test %edi,%edi # m==0? jz l1 test %esi,%esi # n==0? jz l2 dec %esi # n-- push %rdi # save m; no need to save n jmp ack_loop # ack(m,n-1) l1: leal 1(%esi),%eax # m=0: return n+1 pop %rdi # restore m mov %eax, %esi # ack(m,n-1) dec %edi # m-1 cmp %ecx,%esp jne ack_loop # ack(m-1,ack(m,n-1)) ret l2: dec %edi # n=0; return ack(m-1,1) inc %esi jmp ack_loop # #END
Jim Dempsey
- Marcar como novo
- Marcador
- Subscrever
- Silenciar
- Subscrever fonte RSS
- Destacar
- Imprimir
- Denunciar conteúdo inapropriado
In looking at the code, you could optimize to:
# int ack(int m,int n), 64-bit GCC calling convention # .text .globl ack ack: # m in edi, n in esi mov %esp,%ecx ack_loop: test %edi,%edi # m==0? jz l1 test %esi,%esi # n==0? jz l2 dec %esi # n-- push %rdi # save m; no need to save n jmp ack_loop # ack(m,n-1) l1: inc %esi # m=0: return n+1 pop %rdi # restore m # ack(m,n-1) dec %edi # m-1 cmp %ecx,%esp jne ack_loop # ack(m-1,ack(m,n-1)) ret l2: dec %edi # n=0; return ack(m-1,1) inc %esi jmp ack_loop # #END
Jim Dempsey
- Marcar como novo
- Marcador
- Subscrever
- Silenciar
- Subscrever fonte RSS
- Destacar
- Imprimir
- Denunciar conteúdo inapropriado
Jim,
Thanks for your comments. Indeed, as you suggest a register such as RCX can be used to save the value of RSP at function entry, and the RET instruction executed only if RSP and RCX match just before the RET instruction.
Alternatively, the recursion depth can be held in a register (such as ECX/RCX, again) and the only remaining RET instruction can be preceded by a LOOP instruction. For users' convenience, I include this version below. There is no need for any CALL instruction, but the program still assumes the availability of a stack of sufficient size.
[bash]
# int ack(int m,int n), 64-bit GCC calling convention
#
.text
.globl ack
ack: # m in edi, n in esi
movl $1,%ecx # recursion depth + 1
ack0:
test %edi,%edi # m==0?
jz l1
test %esi,%esi # n==0?
jz l2
dec %esi # n--
push %rdi # save m; no need to save n
inc %ecx # recursion depth
jmp ack0 # ack(m,n-1)
res:
pop %rdi # restore m
mov %eax, %esi # ack(m,n-1)
dec %edi # m-1
jmp ack0 # ack(m-1,ack(m,n-1))
l1:
leal 1(%esi),%eax # m=0: return n+1
loop res # if recursion depth is not zero
ret
l2:
dec %edi # n=0; return ack(m-1,1)
inc %esi
jmp ack0
#
#END
[/bash]
- Marcar como novo
- Marcador
- Subscrever
- Silenciar
- Subscrever fonte RSS
- Destacar
- Imprimir
- Denunciar conteúdo inapropriado
If a stack is used, it's still recursive even without an explicit call.
- Marcar como novo
- Marcador
- Subscrever
- Silenciar
- Subscrever fonte RSS
- Destacar
- Imprimir
- Denunciar conteúdo inapropriado
>>If a stack is used, it's still recursive even without an explicit call.
That is why I called it arguably.
I will say that in an abstract sense, the formal description of the Ackermann function is recursive, however, I would also argue that the above code is iterative. IMHO
Jim Dempsey
- Marcar como novo
- Marcador
- Subscrever
- Silenciar
- Subscrever fonte RSS
- Destacar
- Imprimir
- Denunciar conteúdo inapropriado
I tried that argument on my CS professor - didn't fly.

- Subscrever fonte RSS
- Marcar tópico como novo
- Marcar tópico como lido
- Flutuar este Tópico para o utilizador atual
- Marcador
- Subscrever
- Página amigável para impressora