- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hello, I noticed this is noticably slower with ifort 14.0.0 than with gfortran. Is there anything to make the result run faster?
use iso_fortran_env real(real32) :: x32 integer(int64) :: t1, t2, trate call system_clock(count_rate=trate) call system_clock(count=t1) do i=1,int(2E9) x32 = swapb32(x32) end do call system_clock(count=t2) print *,x32,"time",real(t2-t1)/trate contains function swapb32(x) result(res) real(real32) :: res real(real32),intent(in) :: x character(4) :: bytes integer(int32) :: t bytes = transfer(x,bytes) !equivalence very slightly faster, but problematic. t = ichar(bytes(4:4),int64) t = ior( ishftc(ichar(bytes(3:3),int32),8), t ) t = ior( ishftc(ichar(bytes(2:2),int32),16), t ) t = ior( ishftc(ichar(bytes(1:1),int32),24), t ) res = transfer(t, res) end function end
> gfortran-4.9 -Ofast bitperf.f90 > ./a.out 0.00000000 time 5.00823069 > ifort -fast bitperf.f90 ipo: remark #11001: performing single-file optimizations ipo: remark #11006: generating object file /tmp/ipo_ifort0hhp0O.o > ./a.out 0.0000000E+00 time 13.35413
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I can shave a few seconds off of ifort's performance by forcing the loop at line 8 to vectorize (!DIR$ SIMD), but we are still slow compared to gfortran. I will discuss with the developers. What hardware are you using?
Patrick
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Thank you for your suggestion. I am using Intel Core i7-3770.
Vladimir
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
With the sometimes requirement to handle big endian and little endian data after read and before write you would think Fortran would have intrinsic functions SWAB and ISWAB. For now it would be best to create a generic function that calls a C function to swap bytes with 2, 4 and 8 byte variations.
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Patrick, the loop at line 8 cannot be vectorized. The variable x32 has temporal dependencies.
In actual use, the programmer may have an array of backwards byte order real32 and/or real64 variables. This could be vectorizable with a bit of work.
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Thanks for the feedback Jim. Yes, I know there are FLOW and OUTPUT dependencies if you force vectorisation, but I was just curious to see what kind of speedup you could expect if those were removed (not much, it appears).
The performance laggard is the transfer intrinsic.
In particular, there is a lot of overhead coming from calls to 'optimized' memmove; these also inhibit vectorisation:
[hacman@starch7 ifort-14.0.2.144-g]$ ifort -O3 -xHost -opt-report U508330-orig.f90 -opt-report 2>&1 |grep memmove
-> memmove(EXTERN)
-> memmove(EXTERN)
U508330-orig.f90(10:11-10:11):VEC:MAIN__: vectorization support: call to function memmove cannot be vectorized
vectorization support: call to function memmove cannot be vectorized
U508330-orig.f90(10:11-10:11):CG:MAIN__: call to memmove implemented as a call to optimized library version
U508330-orig.f90(10:11-10:11):CG:MAIN__: call to memmove implemented as a call to optimized library version
[hacman@starch7 ifort-14.0.2.144-g]$
I've logged a performance issue with the developers, tracking ID #DPD200254834. I'll keep this thread updated with news.
Patrick
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
! would have thought the test data object in the outer scope would need to be a sizeable array in order for vectorization and related performance tests to be meaningful.
memmove() should use suitable instructions (wider than 32-bit) internally, but it has to check (as memcpy would) for alignment adjustments. besides checking for a requirement to reverse direction in case of overlap. So it must be slow when used on a scalar real.
I don't see how reversing the direction helps here; the !dir$ simd assertion may be OK on the basis that the memory reads are at least as large as the write-back and use the same group of bytes.
"optimized library version" should mean Intel's own version, so should not share the limitation of past glibc of using 64-bit moves to accommodate those CPUs which preferred those to wider moves (including Westmere CPUs in case of misalignment). But it would take a fairly big move to overcome all the initial checking overhead.
Presumably, !dir$ simd prevents a memmove, and then allows streaming store optimizations if they look appropriate (e.g. on an array of several thousand elements).
If x32 is declared as an aligned array, !dir$ vector nontemporal aligned might be tried
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
For performance in Fortran-only code, I would create a user defined type with a UNION (untested code
[fortran]
function Real4Swap(x)
real(4) :: x
type Real4union
union
map
real(4) :: asReal4
end map
map
integer(1): asInteger1(4)
end map
end union
end type Real4union
type(Real4union) :: in, out
in%asReal4 = x
out%asInteger1(1) = in%asInteger1(4)
out%asInteger1(2) = in%asInteger1(3)
out%asInteger1(3) = in%asInteger1(2)
out%asInteger1(4) = in%asInteger1(1)
Real4Swap = out%asReal4
end function Real4Swap
[/fortran]
Make a similar function for the other types, then make a generic wrapper (SWAB)
Vladmir, since you posted this, you may have a need. If you implement this, could you post the comparative test data using above vs TRANSFER.
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Jim,
I won't be surprised if how Intel Fortran implements TRANSFER (standard Fortran feature) and UNION/MAP (non-standard Fortran: Intel extension) has commonalities and they have similar performance. So Vladimir may be better off with a mixed-language solution
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I personally would use a mixed language and provide multiple interfaces whereby you could pass in scalars and arrays. It would make use of the _bswap and _bswap64 intrinsic functions.
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
The real4swap function is only slightly faster (11.1 s vs 13.3 s) and unacceptably non-standard (for me).
What actually works is equivalence:
function swapb32eq(x) result(res) real(real32) :: res real(real32),intent(in) :: x character(4) :: bytes integer(int32) :: t, tmp real(real32) :: rtmp equivalence (tmp, rtmp, bytes) tmp = x !equivalence very slightly faster, but problematic. t = ichar(bytes(4:4),int64) t = ior( ishftc(ichar(bytes(3:3),int32),8), t ) t = ior( ishftc(ichar(bytes(2:2),int32),16), t ) t = ior( ishftc(ichar(bytes(1:1),int32),24), t ) res = rtmp end function
This takes 5.12 seconds on both gfortran and ifort. I wanted to avoid that as I think there is still some non-standardness there (gfortran complains about non-numeric type in equivalence when strict standard conformance is enabled).
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Sorry, this is obviously wrong. The correct version is
function swapb32eq(x) result(res) real(real32) :: res real(real32),intent(in) :: x character(4) :: bytes integer(int32) :: t real(real32) :: tmp, rtmp equivalence (tmp, bytes) equivalence (t, rtmp) tmp = x !equivalence very slightly faster, but problematic. t = ichar(bytes(4:4),int64) t = ior( ishftc(ichar(bytes(3:3),int32),8), t ) t = ior( ishftc(ichar(bytes(2:2),int32),16), t ) t = ior( ishftc(ichar(bytes(1:1),int32),24), t ) res = rtmp end function
But it remains faster, at least for ifort. Strangely, it is slightly slower for gfortran 4.9 but not that much.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I'm a bit hesitant to post this, but if you have the 14.0 compiler, this works.
module swap_intrinsics interface function bswap (arg)bind(C,name="_bswap") use, intrinsic :: iso_c_binding !DEC$ ATTRIBUTES KNOWN_INTRINSIC :: bswap integer(C_INT32_T) :: bswap integer(C_INT32_T), VALUE :: arg end function bswap function bswap64 (arg)bind(C,name="_bswap64") use, intrinsic :: iso_c_binding !DEC$ ATTRIBUTES KNOWN_INTRINSIC :: bswap64 integer(C_INT64_T) :: bswap64 integer(C_INT64_T), VALUE :: arg end function bswap64 end interface end module swap_intrinsics program test use swap_intrinsics integer a,b integer(8) c,d a = Z'01020304' b = bswap(a) print '(Z8.8)', b c = Z'0102030405060708' d = bswap64(c) print '(Z16.16)', d end
This is NOT a documented or supported feature, but it works for instruction intrinsics whose arguments and results are available Fortran types. It doesn't work for instructions that access MMX or SSE registers.
Another example:
program test_cpuid use ISO_C_BINDING implicit none interface subroutine cpuid (CPUInfo, InfoType) BIND(C,name="__cpuid") import !DEC$ ATTRIBUTES KNOWN_INTRINSIC :: cpuid integer(C_INT), dimension(4), intent(out) :: CPUInfo integer(C_INT), intent(in), value :: InfoType end subroutine cpuid end interface integer(C_INT), dimension(4) :: Info call cpuid(Info, 0) print '(3A4)', Info([2,4,3]) end
The name in name= should be the C intrinsic name. This tape will self-destruct in five seconds.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Thanks for sharing this.
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
For fun, without running the CPUID example, can anyone guess what it will print when run on, say, a Core i5 processor?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I don't see what's specific about core I5, but it gives me what I assume Steve intended with ifort and gfortran on win8.1 X64. My personal preference is to read from /proc/cpuinfo (even on Windows/cygwin).
character(80) txtline
open(11,file='/proc/cpuinfo',action='READ',form='formatted',
&access='stream',iostat=ios)
if(ios==0)then
read(11,'(a)',iostat=ios)txtline
do while(ios == 0 .and. index(txtline,'model name') == 0)
read(11,'(a)',iostat=ios)txtline
enddo
....
else
write(*,*)'Failed to open /proc/cpuinfo'
endif
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Well, I said Core i5 as the results would differ on, say, an Opteron.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
>> the results would differ on, say, an Opteron.
This is a trick question and is dependent on how the runtime system defines _bswab and _bswab64.
IIF the byte orders are truly to be swapped within the context of INTEGER(4) and INTEGER(8) then 04030201 and 0807060504030201
however
IIF the byte orders are defined to be swapped Little-Endian to Big-Endian (or Big-Endian to Little-Endian) then 04030201 and 0403020108070605 (each 32-bit/4-bytes swapped). Little-Endian/Big-Endian is dependent upon the memory "word" size not WORD size. This is typically the GP register width.
I think this is an implementation issue. In conversion of data from a binary file written in Big-Endian to Little-Endian you would want the second definition when converting REAL(8) (using bswap64).
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Jim, my "fun quiz" was specific to the CPUID example. I would not expect BSWAP to be different on the different processors. No trick involved.

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page