Confusing run time of fortran code compared to c code

edward_s_3 · ‎07-13-2016

I took the example C code from chapter 3 of the book "Intel Xeon Phi Coprocessor High Performance Programming" and rewrote it in Fortran. I have run both versions on our Intel(R) Xeon(R) CPU E5-2620 v3 and in both cases we are timing the execution of the stencil9pt_base function/subroutine. At this point there is no parralelization of the code I want to make sure the base version is working properly first. I found that while I get the same resulting array, the C code runs about twice as fast as my fortran version when compiled with the same optimization level. I expected to get about the same execution time for both versions. Running stencil9pt_base 10 times in the C version takes 6.853 seconds and the fortran version takes 14.364 seconds.

My first thought was that perhaps the compiler was automatically using a double length of 16 bytes in fortran and only 8 bytes in c but that does not appear to be the case. I specifed the default real length for the fortran compiler and it made no difference.

The code is available here in full with a makefile:

git clone https://zorro_has_gangrene@bitbucket.org/zorro_has_gangrene/chapter3.git

and I will paste in the section that is getting timed here too:

C version

void
stencil9pt_base(REAL *finp, REAL *foutp, 
		int width, int height,
                REAL ctr, REAL next, REAL diag,int count)
{
  REAL *fin = finp;
  REAL *fout = foutp;
  int i,x,y;

  for (i=0; i<count; i++) {
    for (y=1; y < height-1; y++) {
       // starting center pt (avoid halo)
      int c = 1 + y*WIDTHP+1;
      // offsets from center pt.
      int n = c-WIDTHP;
      int s = c+WIDTHP;
      int e = c+1;
      int w = c-1;
      int nw = n-1;
      int ne = n+1;
      int sw = s-1;
      int se = s+1;

      for (x=1; x < width-1; x++) {
        fout = diag * fin[nw] +  
                  diag * fin[ne] +
                  diag * fin[sw] +
                  diag * fin[se] +
                  next * fin +
                  next * fin +
                  next * fin + 
                  next * fin +
                  ctr  * fin;

        // increment to next location
        c++;n++;s++;e++;w++;nw++;ne++;sw++;se++;
      }
    }
    REAL *ftmp = fin;
    fin = fout;
    fout = ftmp;
  }
  return;
}

FORTRAN version

  subroutine stencil9pt_base(fin,fout,width,height,ctr,next,diag,intCount)
    integer, parameter   :: dp=kind(0.d0)  
    real(dp),pointer,intent(inout) :: fin(:)
    real(dp),pointer,intent(inout) :: fout(:)
    integer, intent(in)  :: width,height,intCount
    real(dp),intent(in)  :: ctr,next,diag
    real(dp),pointer     :: ftmp(:)
    integer :: i,x,y,c,n,s,e,w,nw,ne,sw,se
    print *, "Running stencil kernel:", intCount,height,width,WIDTHPCONSTANT
    allocate(ftmp(width*height))

    do i=1,intCount
       do y=2,height-1
          ! starting center pt (avoid halo)
          c=1+(y-1)*WIDTHPCONSTANT+1
          ! offsets from center pt.
          n=c-WIDTHPCONSTANT
          s=c+WIDTHPCONSTANT

          e=c+1
          w=c-1
          nw=n-1
          ne=n+1
          sw=s-1
          se=s+1
          do x=2,width-1
             fout(c)=diag*fin(nw)+ &
                  diag*fin(ne)+ &
                  diag*fin(sw)+ &
                  diag*fin(se)+ &
                  next*fin(w)+ &
                  next*fin(e)+ &
                  next*fin(n)+ &
                  next*fin(s) + &
                  ctr*fin(c)
             c=c+1
             n=n+1
             s=s+1
             e=e+1
             w=w+1
             nw=nw+1
             ne=ne+1
             sw=sw+1
             se=se+1
          enddo
       enddo
      ftmp=>fin
      fin=>fout
      fout=>ftmp
   enddo
 end subroutine stencil9pt_base

FortranFan · ‎07-13-2016

Have you done some code profiling? How big are width and height and do they change from call to call? Perhaps it is the memory allocation on line 10 which is the problem? Perhaps you can make your ftmp variable in Fortran an automatic array, declared simply as real(dp) :: ftmp(width*height)?

TimP · ‎07-13-2016

You might check the optimization report. For example, icc may deal better with the extreme number of post-incremented indices than ifort does. I'll fire up my ancient machine and try to grab your code.

As your optimization report indicates that the compiler assumes aliasing which prevents vectorization for both your C and Fortran examples, there is little point in comparing performance on MIC, if you were trying to compare performance of the inner loop. It does appear that the time is nearly entirely spent elsewhere.

Your Fortran assumes nonlinear storage, unless the arrays are declared contiguous. This alone may more than account for the performance difference, although I still don't see that you are solving the same problem in C and Fortran.

Xiaoping_D_Intel · ‎07-13-2016

Tried the code from git clone:

1. Both makefile are using "-O0" option which will disable all optimizations. Normally for performance testing at least "-O2" should be applied. Using O3 the result I got on my xeon E5-2680 is C: 1.951s and Fortran: 4.251s

2. In the fortran code the subroutine is defined in a module source file and called in another source file so enabling IPO with "-ipo" can help optimization. After adding “-ipo" to the makefile the innermost loop of "stencil9pt_base" can be vectorized by the compiler. The result of Fortran vectorized code is 3.311s.

3. Adding "CONTIGUOUS" to the declaration of "fin" and "fout" as Tim suggested can get better vectorized code: 2.739s. Combined IPO with "CONTIGUOUS" the fortran code can get 1.509s. It is faster than the C code now.

4. The optimization report of the C code shows there is assumed dependency between "fin" and "fout" in the innermost loop. It must be caused by the value exchange after the loop. Adding "#pragma ivdep" before the loop to removing this assumed dependency can vectorize it. Its testing result is 1.510s. Now the C and Fortran code have quite similar performance results.

Thanks,

Xiaoping Duan

Intel Customer Support

TimP · ‎07-14-2016

In the C code, adding a restrict qualifier for fout was sufficient to enable optimization. Many C compilers default to C99, but Intel requires -std=c99 to be set.

I suppose Xiaoping may have guessed right, that you don't care about testing the coprocessor.

jimdempseyatthecove · ‎07-14-2016

Also, in the event that WIDTHPCONSTANT is indeed a constant, IOW "const" in C/C++ or PARAMETER in Fortran, then you can make your displacements about c also constants. By doing this you will decrease the number of registers used, and simplify the traversal.

 subroutine stencil9pt_base(fin,fout,width,height,ctr,next,diag,intCount)
   integer, parameter   :: dp=kind(0.d0)  
   real(dp),pointer,intent(inout), CONTIGUOUS :: fin(:)
   real(dp),pointer,intent(inout), CONTIGUOUS :: fout(:)
   integer, intent(in)  :: width,height,intCount
   real(dp),intent(in)  :: ctr,next,diag
   real(dp),pointer     :: ftmp(:)
   ! *** assuming WIDTHPCONSTANT is indeed a constant PARAMETER defined elsewhere
   ! offsets from center pt created as parameters.
   integer, parameter :: n = -WIDTHPCONSTANT
   integer, parameter :: s =   WIDTHPCONSTANT
   integer, parameter :: e =  1
   integer, parameter :: w = -1
   integer, parameter :: nw = n-1
   integer, parameter :: ne = n+1
   integer, parameter :: sw = s-1
   integer, parameter :: se = s+1
   integer :: i,x,y,c
   print *, "Running stencil kernel:", intCount,height,width,WIDTHPCONSTANT
   allocate(ftmp(width*height))
   do i=1,intCount
      do y=2,height-1
         ! starting center pt (avoid halo)
         x=1+(y-1)*WIDTHPCONSTANT+1
         do c=x,x+width-2
            fout(c)=diag*fin(c+nw)+ &
                 diag*fin(c+ne)+ &
                 diag*fin(sw)+ &
                 diag*fin(c+se)+ &
                 next*fin(c+w)+ &
                 next*fin(c+e)+ &
                 next*fin(c+n)+ &
                 next*fin(c+s) + &
                 ctr*fin(c)
         enddo
      enddo
     ftmp=>fin
     fin=>fout
     fout=>ftmp
  enddo
end subroutine stencil9pt_base

*** above untested ***

The Intel64 instruction set has an addressing format called SIB (Scale, Index, Base). Where the Base can be expressed either entirely by a register or by a register+Offset. Your former code was not utilizing the +Offset form. The above code (assuming WIDTHPCONSTANT is a parameter) will use the +Offset format (offsets will all be constant parameters), and this will eliminate eight of the index increments.

Furthermore, the entire loop now uses a single register for Index. This will make it easier for the compiler vectorization code to determine if the loop is vectorizable (this together with the CONTIGUOUS attribute as TimP suggests).

Jim Dempsey

edward_s_3 · ‎07-14-2016

Thanks so much for the excellent suggestions. I specified the pointers as contiguous with no other changes and it shaved about 4 seconds off of the FORTRAN code.

I wanted to follow the flow in the book which takes the base unoptimized code and then improves it by parallelizing, and vectorizing and then comparing to the results on the phi card with all of the improvements. That is why I started by using the -O0 option when compiling. I used your other suggestions as I improved the code and am getting pretty impressive results both on the host and on the phi card.