For performance in Fortran

Vladimir_F_1 · ‎03-25-2014

Hello, I noticed this is noticably slower with ifort 14.0.0 than with gfortran. Is there anything to make the result run faster?

  use iso_fortran_env

  real(real32) :: x32
  integer(int64) :: t1, t2, trate

  call system_clock(count_rate=trate)
  call system_clock(count=t1)
  do i=1,int(2E9)
    x32 = swapb32(x32)
  end do
  call system_clock(count=t2)
  print *,x32,"time",real(t2-t1)/trate

contains
    function swapb32(x) result(res)
      real(real32) :: res
      real(real32),intent(in) :: x
      character(4) :: bytes
      integer(int32) :: t
      
      bytes = transfer(x,bytes) !equivalence very slightly faster, but problematic.

      t = ichar(bytes(4:4),int64)

      t = ior( ishftc(ichar(bytes(3:3),int32),8),  t )

      t = ior( ishftc(ichar(bytes(2:2),int32),16), t )

      t = ior( ishftc(ichar(bytes(1:1),int32),24), t )

      res = transfer(t, res)
        
    end function
end

> gfortran-4.9 -Ofast bitperf.f90 
> ./a.out 
   0.00000000     time   5.00823069    
> ifort -fast bitperf.f90 
ipo: remark #11001: performing single-file optimizations
ipo: remark #11006: generating object file /tmp/ipo_ifort0hhp0O.o
> ./a.out 
  0.0000000E+00 time   13.35413

pbkenned1 · ‎03-25-2014

I can shave a few seconds off of ifort's performance by forcing the loop at line 8 to vectorize (!DIR$ SIMD), but we are still slow compared to gfortran. I will discuss with the developers. What hardware are you using?

Patrick

Vladimir_F_1 · ‎03-25-2014

Thank you for your suggestion. I am using Intel Core i7-3770.

Vladimir

jimdempseyatthecove · ‎03-25-2014

With the sometimes requirement to handle big endian and little endian data after read and before write you would think Fortran would have intrinsic functions SWAB and ISWAB. For now it would be best to create a generic function that calls a C function to swap bytes with 2, 4 and 8 byte variations.

Jim Dempsey

jimdempseyatthecove · ‎03-25-2014

Patrick, the loop at line 8 cannot be vectorized. The variable x32 has temporal dependencies.

In actual use, the programmer may have an array of backwards byte order real32 and/or real64 variables. This could be vectorizable with a bit of work.

Jim Dempsey

pbkenned1 · ‎03-25-2014

Thanks for the feedback Jim. Yes, I know there are FLOW and OUTPUT dependencies if you force vectorisation, but I was just curious to see what kind of speedup you could expect if those were removed (not much, it appears).

The performance laggard is the transfer intrinsic.

In particular, there is a lot of overhead coming from calls to 'optimized' memmove; these also inhibit vectorisation:

[hacman@starch7 ifort-14.0.2.144-g]$ ifort -O3 -xHost -opt-report U508330-orig.f90 -opt-report 2>&1 |grep memmove
-> memmove(EXTERN)
-> memmove(EXTERN)
U508330-orig.f90(10:11-10:11):VEC:MAIN__: vectorization support: call to function memmove cannot be vectorized
vectorization support: call to function memmove cannot be vectorized
U508330-orig.f90(10:11-10:11):CG:MAIN__: call to memmove implemented as a call to optimized library version
U508330-orig.f90(10:11-10:11):CG:MAIN__: call to memmove implemented as a call to optimized library version
[hacman@starch7 ifort-14.0.2.144-g]$

I've logged a performance issue with the developers, tracking ID #DPD200254834. I'll keep this thread updated with news.

Patrick

TimP · ‎03-25-2014

! would have thought the test data object in the outer scope would need to be a sizeable array in order for vectorization and related performance tests to be meaningful.

memmove() should use suitable instructions (wider than 32-bit) internally, but it has to check (as memcpy would) for alignment adjustments. besides checking for a requirement to reverse direction in case of overlap. So it must be slow when used on a scalar real.

I don't see how reversing the direction helps here; the !dir$ simd assertion may be OK on the basis that the memory reads are at least as large as the write-back and use the same group of bytes.

"optimized library version" should mean Intel's own version, so should not share the limitation of past glibc of using 64-bit moves to accommodate those CPUs which preferred those to wider moves (including Westmere CPUs in case of misalignment). But it would take a fairly big move to overcome all the initial checking overhead.

Presumably, !dir$ simd prevents a memmove, and then allows streaming store optimizations if they look appropriate (e.g. on an array of several thousand elements).

If x32 is declared as an aligned array, !dir$ vector nontemporal aligned might be tried

jimdempseyatthecove · ‎03-26-2014

For performance in Fortran-only code, I would create a user defined type with a UNION (untested code

[fortran]
function Real4Swap(x)
real(4) :: x
type Real4union
    union
      map
        real(4) :: asReal4
      end map
      map
          integer(1): asInteger1(4)
      end map
    end union
end type Real4union
type(Real4union) :: in, out
in%asReal4 = x
out%asInteger1(1) = in%asInteger1(4)
out%asInteger1(2) = in%asInteger1(3)
out%asInteger1(3) = in%asInteger1(2)
out%asInteger1(4) = in%asInteger1(1)
Real4Swap = out%asReal4
end function Real4Swap
[/fortran]

Make a similar function for the other types, then make a generic wrapper (SWAB)

Vladmir, since you posted this, you may have a need. If you implement this, could you post the comparative test data using above vs TRANSFER.

Jim Dempsey

FortranFan · ‎03-26-2014

Jim,

I won't be surprised if how Intel Fortran implements TRANSFER (standard Fortran feature) and UNION/MAP (non-standard Fortran: Intel extension) has commonalities and they have similar performance. So Vladimir may be better off with a mixed-language solution

jimdempseyatthecove · ‎03-26-2014

I personally would use a mixed language and provide multiple interfaces whereby you could pass in scalars and arrays. It would make use of the _bswap and _bswap64 intrinsic functions.

Jim Dempsey

Vladimir_F_1 · ‎03-27-2014

The real4swap function is only slightly faster (11.1 s vs 13.3 s) and unacceptably non-standard (for me).

What actually works is equivalence:

    function swapb32eq(x) result(res)
      real(real32) :: res
      real(real32),intent(in) :: x
      character(4) :: bytes
      integer(int32) :: t, tmp
      real(real32) :: rtmp
      equivalence (tmp, rtmp, bytes)
      
      tmp = x !equivalence very slightly faster, but problematic.

      t = ichar(bytes(4:4),int64)

      t = ior( ishftc(ichar(bytes(3:3),int32),8),  t )

      t = ior( ishftc(ichar(bytes(2:2),int32),16), t )

      t = ior( ishftc(ichar(bytes(1:1),int32),24), t )

      res = rtmp
        
    end function

This takes 5.12 seconds on both gfortran and ifort. I wanted to avoid that as I think there is still some non-standardness there (gfortran complains about non-numeric type in equivalence when strict standard conformance is enabled).

TimP · ‎03-27-2014

this equivalence technically violates f77 but I don't see how transfer could be safer.

Vladimir_F_1 · ‎03-27-2014

Sorry, this is obviously wrong. The correct version is

    function swapb32eq(x) result(res)
      real(real32) :: res
      real(real32),intent(in) :: x
      character(4) :: bytes
      integer(int32) :: t
      real(real32) :: tmp, rtmp
      equivalence (tmp, bytes)
      equivalence (t, rtmp)
      
      tmp = x !equivalence very slightly faster, but problematic.

      t = ichar(bytes(4:4),int64)

      t = ior( ishftc(ichar(bytes(3:3),int32),8),  t )

      t = ior( ishftc(ichar(bytes(2:2),int32),16), t )

      t = ior( ishftc(ichar(bytes(1:1),int32),24), t )

      res = rtmp
        
    end function

But it remains faster, at least for ifort. Strangely, it is slightly slower for gfortran 4.9 but not that much.

Steven_L_Intel1 · ‎03-27-2014

I'm a bit hesitant to post this, but if you have the 14.0 compiler, this works.

module swap_intrinsics
interface
function bswap (arg)bind(C,name="_bswap")
use, intrinsic :: iso_c_binding
!DEC$ ATTRIBUTES KNOWN_INTRINSIC :: bswap
integer(C_INT32_T) :: bswap
integer(C_INT32_T), VALUE :: arg
end function bswap
function bswap64 (arg)bind(C,name="_bswap64")
use, intrinsic :: iso_c_binding
!DEC$ ATTRIBUTES KNOWN_INTRINSIC :: bswap64
integer(C_INT64_T) :: bswap64
integer(C_INT64_T), VALUE :: arg
end function bswap64
end interface
end module swap_intrinsics

program test
use swap_intrinsics
integer a,b
integer(8) c,d
a = Z'01020304'
b = bswap(a)
print '(Z8.8)', b
c = Z'0102030405060708'
d = bswap64(c)
print '(Z16.16)', d
end

This is NOT a documented or supported feature, but it works for instruction intrinsics whose arguments and results are available Fortran types. It doesn't work for instructions that access MMX or SSE registers.

Another example:

program test_cpuid
    use ISO_C_BINDING
    implicit none

    interface
    subroutine cpuid (CPUInfo, InfoType) BIND(C,name="__cpuid")
    import
    !DEC$ ATTRIBUTES KNOWN_INTRINSIC :: cpuid
    integer(C_INT), dimension(4), intent(out) :: CPUInfo
    integer(C_INT), intent(in), value :: InfoType
    end subroutine cpuid
    end interface

    integer(C_INT), dimension(4) :: Info
    call cpuid(Info, 0)
    print '(3A4)', Info([2,4,3])
    end

The name in name= should be the C intrinsic name. This tape will self-destruct in five seconds.

jimdempseyatthecove · ‎03-27-2014

Thanks for sharing this.

Jim Dempsey

Steven_L_Intel1 · ‎03-27-2014

For fun, without running the CPUID example, can anyone guess what it will print when run on, say, a Core i5 processor?

TimP · ‎03-27-2014

I don't see what's specific about core I5, but it gives me what I assume Steve intended with ifort and gfortran on win8.1 X64. My personal preference is to read from /proc/cpuinfo (even on Windows/cygwin).

character(80) txtline

open(11,file='/proc/cpuinfo',action='READ',form='formatted',
&access='stream',iostat=ios)
if(ios==0)then
read(11,'(a)',iostat=ios)txtline
do while(ios == 0 .and. index(txtline,'model name') == 0)
read(11,'(a)',iostat=ios)txtline
enddo

....

else
write(*,*)'Failed to open /proc/cpuinfo'
endif

Steven_L_Intel1 · ‎03-27-2014

Well, I said Core i5 as the results would differ on, say, an Opteron.

jimdempseyatthecove · ‎03-28-2014

>> the results would differ on, say, an Opteron.

This is a trick question and is dependent on how the runtime system defines _bswab and _bswab64.

IIF the byte orders are truly to be swapped within the context of INTEGER(4) and INTEGER(8) then 04030201 and 0807060504030201

however

IIF the byte orders are defined to be swapped Little-Endian to Big-Endian (or Big-Endian to Little-Endian) then 04030201 and 0403020108070605 (each 32-bit/4-bytes swapped). Little-Endian/Big-Endian is dependent upon the memory "word" size not WORD size. This is typically the GP register width.

I think this is an implementation issue. In conversion of data from a binary file written in Big-Endian to Little-Endian you would want the second definition when converting REAL(8) (using bswap64).

Jim Dempsey

Steven_L_Intel1 · ‎03-28-2014

Jim, my "fun quiz" was specific to the CPUID example. I would not expect BSWAP to be different on the different processors. No trick involved.

Bad performance with bitwise operations