Intel® Fortran Compiler
Build applications that can scale for the future with optimized code designed for Intel® Xeon® and compatible processors.

Recursive function and pass by VALUE optimizations

pat97269
Principiante
5.486 Visualizações

Hi all,

I have some questions about the recursive functions performance in the compiler.

The code is below, the performance table obtains with the compiler 14.  First column is normal, second column is pass by value.

Ifort is slower than gfortran(4.8.2), and is passing by reference creating more copies in the recursive case ?

I don't know if performance is better with old compiler version.

Thanks.

ifort         3.51  1.151

gfortran  2.027  0.452

program ackermann
         implicit none
   write(*,*) ack(3, 12)
   contains
 recursive function ack(m, n) result(a)
   integer, intent(in) :: m,n
   integer :: a

   if (m == 0) then
     a=n+1
   else if (n == 0) then
     a=ack(m-1,1)
   else
     a=ack(m-1, ack(m, n-1))
   end if
 end function ack
 end program ackermann

 

0 Kudos
22 Respostas
JVanB
Contribuidor valorado II
4.474 Visualizações

Internal procedures are semantically distinct from module procedures and BIND(C) procedures may handle the VALUE attribute differently as well. Have you measured the effect of making function ack a module procedure and of giving it the BIND(C) attribute (8 tests in all)?

pat97269
Principiante
4.474 Visualizações

 

Putting the function in a module with or without bind(C) doesn't change the results.

The gap of performance between value and no value  , and between ifort and gfortran is still there.

 

 

 

Steven_L_Intel1
Funcionário
4.474 Visualizações

I tried this with gfortran 4.8.2 and ifort 14.0.2 was faster by about 10%. But this is hardly representative of a real application. Is this just a question of academic interest?

pat97269
Principiante
4.474 Visualizações

 

Can you please provide your options ? in which case ifort was faster w/witout value? 

I'm  running on a  Intel(R) Core(TM) i7-3520M, i compile with -O3 and fast.

No, by pure curiosity and interest for the language , and as I'm used to see ifort being faster I was surprise. Also I'd like to know why in this case passing by value is faster, so maybe i can use it in my real codes.

Thanks.

Steven_L_Intel1
Funcionário
4.474 Visualizações

I just used -O2 on a Core i7 965 (Nehalem). I didn't try adding VALUE. You didn't show how you changed the code to pass and receive by value - I would not normally expect that to make any noticeable difference. Perhaps something else is going on.

Were you building 32-bit or 64-bit? If 64-bit, I could understand VALUE being faster due to the arguments being passed in registers rather than on the stack.

pat97269
Principiante
4.474 Visualizações

 

I  am on a 64 bit. here is how i put by value :

integer, intent(in),value :: m,n

without value , gfortran -O2 gives  3.598s  with -O3 2.022s .

with value , gfortran -O2 gives 0.767s  with -O3 0.458s .

without value , ifort -O2 gives m3.579s  with -O3 3.572s .

with value , ifort -O2 gives 0.964s  with -O3 0.972s

Pat.

jimdempseyatthecove
Colaborador honorário III
4.474 Visualizações

With value attribute, the m-1 and n-1 in "a=ack(m-1, ack(m, n-1))" need not be written to temporary stack variable in caller (to be passed by reference to callee). It should eliminate two writes and two reads per call (assuming fast call being used, x64 uses fast call by default, x32 depends on options).

Jim Dempsey

jimdempseyatthecove
Colaborador honorário III
4.474 Visualizações

Timing a function like this example bad practice since it leads one to bad assumptions (generalities) about optimizations in total. If you are really concerned about performance... for this function... then you would make it an iterative process as opposed to a recursive process. Doing so would make it much more efficient (execution wise).

The following is a C++ example illustrating this (I will let you convert this to Fortran):


// Serial recursive method to calculate Fibonacci series
long SerialFib( long n )
{
 if( n<2 )
  return n;
 else
  return SerialFib(n-1)+SerialFib(n-2);
}

// Serial non-recursive method to calculate Fibonacci series
// *** Note, several orders of magnitude faster than all
// *** recursive techniques.
long SerialFib2( long n )
{
 if( n<2 )
  return n;
 long fib0 = 1; // most recent
 long fib1 = 1; // 1 prior to most recent
 long fib2 = 0; // 2 prior to most recent
 long i;
 for(i=2; i<n; ++i)
 {
  fib2 = fib1;
  fib1 = fib0;
  fib0 = fib1 + fib2;
 }
 return fib0;
}

Jim Dempsey

Steven_L_Intel1
Funcionário
4.474 Visualizações

That's not enough to pass by value in standard Fortran. ifort has a bug where it DOES cause pass/receive by value, I didn't think gfortran has the same bug. We do it the right way if you say -standard-semantics, which is to pass an anonymous copy by reference. You'd need to add BIND(C) to have VALUE mean pass-by-value. Looking at gfortran's generated code, with VALUE, I do see it putting the argument directly into the registers rather than a temp. Maybe it figures that it can see the whole program so it's safe to do so.

That said, I do see that adding VALUE does significantly improve the performance if -standard-semantics is not used.

Anyway, here are the timings I get:

gfortran with and without VALUE: 5.700 6.055

ifort with and without VALUE: 1.438 4.714

How did you do your timings? I added code to the program using SYSTEM_CLOCK like this:

program ackermann
         implicit none
         integer(8) start, end, rate
         write (*,*) "Start"
         call system_clock(start)
   write(*,*) ack(3,12) ! 12)
   call system_clock (end, rate)
   write (*,*) "Time: ", real(end-start)/real(rate)
   contains
 recursive function ack(m, n) result(a)
   integer, intent(in) :: m,n
   integer :: a

   if (m == 0) then
     a=n+1
   else if (n == 0) then
     a=ack(m-1,1)
   else
     a=ack(m-1, ack(m, n-1))
   end if
 end function ack
 end program ackermann

 

pat97269
Principiante
4.474 Visualizações

 

Thanks for the explanation . Actually I'm not sure how easy the ackermann function is to serialize. Also i understand changing to a faster algorithm, but that's not the point here.

I use my time shell function , but here are the result i got with your code :

gfortran -O3 test.f90
./a.out
 Start
       32765
 Time:    2.02059364    
ifort -O3 test.f90
 ./a.out
 Start
       32765
 Time:    3.741407    

With value :

gfortran -O3 test.f90
 ./a.out
 Start
       32765
 Time:   0.458827287    
 ifort -O3 test.f90
 ./a.out
 Start
       32765
 Time:    1.066044    


 gfortran -v
Using built-in specs.
COLLECT_GCC=gfortran
COLLECT_LTO_WRAPPER=/usr/lib/gcc/x86_64-unknown-linux-gnu/4.8.2/lto-wrapper
Target: x86_64-unknown-linux-gnu
Configured with: /build/gcc/src/gcc-4.8-20140206/configure --prefix=/usr --libdir=/usr/lib --libexecdir=/usr/lib --mandir=/usr/share/man --infodir=/usr/share/info --with-bugurl=https://bugs.archlinux.org/ --enable-languages=c,c++,ada,fortran,go,lto,objc,obj-c++ --enable-shared --enable-threads=posix --with-system-zlib --enable-__cxa_atexit --disable-libunwind-exceptions --enable-clocale=gnu --disable-libstdcxx-pch --disable-libssp --enable-gnu-unique-object --enable-linker-build-id --enable-cloog-backend=isl --disable-cloog-version-check --enable-lto --enable-plugin --enable-install-libiberty --with-linker-hash-style=gnu --disable-multilib --disable-werror --enable-checking=release
Thread model: posix
gcc version 4.8.2 20140206 (prerelease) (GCC)

ifort -v
ifort version 14.0.1

 

jimdempseyatthecove
Colaborador honorário III
4.474 Visualizações

Wrote up a Fortran variation:

!  Fibonacci.f90 
module Fibs
contains
recursive integer(8) function SerialFib( n )
    implicit none
    integer :: n
    if( n .lt. 2) then
        SerialFib = n
    else
        SerialFib = SerialFib(n-1) + SerialFib(n-2)
    endif
end function SerialFib

integer(8) function SerialFib2( n )
    implicit none
    integer :: n, i
    integer(8) :: fib0, fib1, fib2
    if(n .lt. 2) then
        SerialFib2 = n
    else
        fib0 = 1
        fib1 = 1
        fib2 = 0
        do i=2,n-1
            fib2 = fib1
            fib1 = fib0
            fib0 = fib1 + fib2
        enddo
        SerialFib2 = fib0
    endif
    
end function SerialFib2
end module Fibs

program Fibonacci
    use omp_lib ! omp_get_wtime()
    use Fibs
    implicit none

    real(8) :: t1, t2, t3
    integer(8) :: fib, fib2
    integer :: in
    in = 45
    t1 = omp_get_wtime()
    fib = SerialFib(in)
    t1 = omp_get_wtime() - t1
    print *, in, fib, t1

    t2 = omp_get_wtime()
    fib2 = SerialFib2(in)
    t2 = omp_get_wtime() - t2
    print *, in, fib2, t2

    in = in * 2
    t3 = omp_get_wtime()
    fib2 = SerialFib2(in)
    t3 = omp_get_wtime() - t3
    print *, in, fib2, t3

    end program Fibonacci

          45            1134903170   9.69680809543934
          45            1134903170  0.000000000000000E+000
          90   2880067194370816120  3.001186996698380E-007
Press any key to continue . . .

 

Running recursive with input of 45 took 9.7 seconds, iterative was too short to measure.
Input of 90 would take too long to run recursively, and would likely exhaust the stack, the iterative method, barely measurable using omp_get_wtime().

This example, is computationally equivalent to the example provided by Ackermann

Jim Dempsey

Steven_L_Intel1
Funcionário
4.474 Visualizações

pat97269 wrote:
Actually I'm not sure how easy the ackermann function is to serialize.

I remember my sophomore CS instructor giving us an assignment to write a serial version of the Ackermann function. Drove me crazy. I never forgave him when he later told us that it was not possible. See http://en.wikipedia.org/wiki/Ackermann_function

pat97269
Principiante
4.474 Visualizações

 

I clearly understand some algorithms are faster than other. But that doesn't remove the point that in that case,  on my processor, the performance are surprising, and as i understand passing by value in this case allow speedup the computation a little bit. It doesn't explain the difference in performance with steve result.

I will try to run that on other processor and see.

 

Thanks

mecej4
Colaborador honorário III
4.474 Visualizações

For what it is worth, here is an assembler version of the ack(m,n) function, callable from C. Other than depending on the recursion relations to compute the function, it contains 16 instructions, of which only three access memory and consume stack. On a C2D E8400 3 GHz CPU OpenSuse system, the program computed ack(3,12) in 1.4 seconds.

[bash]

# int ack(int m,int n), 64-bit GCC calling convention

#

   .text

   .globl ack

ack:                  #   m in edi, n in esi

   test %edi,%edi     # m==0?

   jz l1

   test %esi,%esi     # n==0?

   jz l2

   dec %esi           # n--

   push %rdi          # save m; no need to save n

   call ack           # ack(m,n-1)

   pop %rdi           # restore m

   mov %eax, %esi     # ack(m,n-1)

   dec %edi           # m-1

   jmp ack            # ack(m-1,ack(m,n-1))

l1:

   leal 1(%esi),%eax  # m=0: return n+1

   ret

l2:

   dec %edi           # n=0; return ack(m-1,1)

   inc %esi

   jmp ack

#

#END

[/bash]Compiler options are irrelevant since the driver (see below) consumes very little time (parsing the command arguments, calling the function, and printing the result).

[cpp]#include <stdio.h>

#include <stdlib.h>

extern int ack(int m,int n);

main(int argc,char *argv[]){

int m,n;

if(argc-3){

   fprintf(stderr,"Usage: ackd <m> <n>\n");

   exit(1);

   }

m=atoi(argv[1]); n=atoi(argv[2]);

printf("%2d %2d %d\\n",m,n,ack(m,n));

}[/cpp]

jimdempseyatthecove
Colaborador honorário III
4.474 Visualizações

mecej4,

Great post. In looking at the assembler code, I think one can convert it into a arguable iterative process. Here is my thought (not tested in code)

On entry to ack: you copy %esp to %ecx (or rcx when coding x64)
After that you add tag ack_loop:
Replace the "call ack" with "jmp ack_loop"
Move the code following the former "call ack" to in front of the "ret"
Add following the above moved code (before the ret) "pop %rdi; cmp %ecx,%esp; jnz ack_loop"

Effectively what you are doing is replacing the call stack with a data stack. The function code is iterative, however it uses a stack-like data structure (that resides on the stack for storage).

So does this then exemplifie an iterative process that implements the Ackerman's function?

--- Edit ---

Untested code:

# int ack(int m,int n), 64-bit GCC calling convention
#
   .text
   .globl ack
ack:                  #   m in edi, n in esi
   mov %esp,%ecx
ack_loop:
   test %edi,%edi     # m==0?
   jz l1
   test %esi,%esi     # n==0?
   jz l2
   dec %esi           # n--
   push %rdi          # save m; no need to save n
   jmp ack_loop       # ack(m,n-1)
l1:
   leal 1(%esi),%eax  # m=0: return n+1
   pop %rdi           # restore m
   mov %eax, %esi     # ack(m,n-1)
   dec %edi           # m-1
   cmp %ecx,%esp
   jne ack_loop       # ack(m-1,ack(m,n-1))
   ret
l2:
   dec %edi           # n=0; return ack(m-1,1)
   inc %esi
   jmp ack_loop
#
#END

Jim Dempsey


 

jimdempseyatthecove
Colaborador honorário III
4.474 Visualizações

In looking at the code, you could optimize to:

# int ack(int m,int n), 64-bit GCC calling convention
#
   .text
   .globl ack
ack:                  #   m in edi, n in esi
   mov %esp,%ecx
ack_loop:
   test %edi,%edi     # m==0?
   jz l1
   test %esi,%esi     # n==0?
   jz l2
   dec %esi           # n--
   push %rdi          # save m; no need to save n
   jmp ack_loop       # ack(m,n-1)
l1:
   inc %esi           # m=0: return n+1
   pop %rdi           # restore m
                      # ack(m,n-1)
   dec %edi           # m-1
   cmp %ecx,%esp
   jne ack_loop       # ack(m-1,ack(m,n-1))
   ret
l2:
   dec %edi           # n=0; return ack(m-1,1)
   inc %esi
   jmp ack_loop
#
#END

Jim Dempsey

mecej4
Colaborador honorário III
4.474 Visualizações

Jim,

Thanks for your comments. Indeed, as you suggest a register such as RCX can be used to save the value of RSP at function entry, and the RET instruction executed only if RSP and RCX match just before the RET instruction.

Alternatively, the recursion depth can be held in a register (such as ECX/RCX, again) and the only remaining RET instruction can be preceded by a LOOP instruction. For users' convenience, I include this version below. There is no need for any CALL instruction, but the program still assumes the availability of a stack of sufficient size.

[bash]

# int ack(int m,int n), 64-bit GCC calling convention

#

   .text

   .globl ack

ack:                  #   m in edi, n in esi

   movl $1,%ecx       # recursion depth + 1

ack0:

   test %edi,%edi     # m==0?

   jz l1

   test %esi,%esi     # n==0?

   jz l2

   dec %esi           # n--

   push %rdi          # save m; no need to save n

   inc %ecx           # recursion depth

   jmp ack0           # ack(m,n-1)

res:

   pop %rdi           # restore m

   mov %eax, %esi     # ack(m,n-1)

   dec %edi           # m-1

   jmp ack0           # ack(m-1,ack(m,n-1))

l1:

   leal 1(%esi),%eax  # m=0: return n+1

   loop res           # if recursion depth is not zero

   ret

l2:

   dec %edi           # n=0; return ack(m-1,1)

   inc %esi

   jmp ack0

#

#END

[/bash]

 

Steven_L_Intel1
Funcionário
4.474 Visualizações

If a stack is used, it's still recursive even without an explicit call.

jimdempseyatthecove
Colaborador honorário III
4.474 Visualizações

>>If a stack is used, it's still recursive even without an explicit call.

That is why I called it arguably.

I will say that in an abstract sense, the formal description of the Ackermann function is recursive, however, I would also argue that the above code is iterative. IMHO

Jim Dempsey

Steven_L_Intel1
Funcionário
4.267 Visualizações

I tried that argument on my CS professor - didn't fly.

Responder