I have spent some time today trying to sort out a bottleneck in my code and been using timef() from the portability library.
I have both linux (Ubuntu 16) and Windows (7) versions of the compiler - both 2017 - update 5. There is only one version of the code and the compiler switches for the two OSs are as good as the same.
It appears to me that the linux version is rounding to an integer. Here is a snip of the debug output for linux
kgsave= 2.000000 move to record= 0.000000 sizegrid= 1.000000 kgload= 1.000000
kgsave= 2.000000 move to record= 0.000000 sizegrid= 0.000000 kgload= 2.000000
kgsave= 1.000000 move to record= 0.000000 sizegrid= 1.000000 kgload= 1.000000
kgsave= 1.000000 move to record= 0.000000 sizegrid= 1.000000 kgload= 1.000000
and here is the same thing for Windows (a virtual machine on the linux box)
kgsave= 0.8750000 move to record= 0.000000 sizegrid= 0.4687500E-01 kgload= 0.7343750
kgsave= 0.7656250 move to record= 0.000000 sizegrid= 0.4687500E-01 kgload= 0.4062500
kgsave= 0.7968750 move to record= 0.000000 sizegrid= 0.3125000E-01 kgload= 0.5312500
kgsave= 0.8437500 move to record= 0.000000 sizegrid= 0.6250000E-01 kgload= 0.7343750
There are some efficiencies in the windows libraries calling the Win API versus the Xorg/Motif calls in linux which explains the faster times but not the precision differences.
The numbers were just created with timer(1)=timef() before the call to the subroutine and timer(2)=timef() after then printing out timer(2)-timer(1). Timer declared as a selected_real_kind(15) array.
Am I doing something wrong or is this problem real?
I haven't looked into your problem, but I suggest that for a portable (Windows<->Linux) timer, use the OpenMP function omp_get_wtime() that returns a high precision timer as a double (as seconds).
Sorry for the slow response, its been a busy day.
As a work around that may be a good solution but it doesn't really answer the question as to whether there is a problem with the linux or maybe just Ubuntu version of timef() or whether I'm doing something wrong in using it.
system_clock() with INT64 arguments is excellent on linux. On ifort Windows, OpenMP or MPI timers may be better. There is also a (non-portable) timer function in MKL.
Thanks Steve and Tim
Lawson Wakefield had also suggested system_clock. My routines were calling his so I had been talking to him about the linux optimisation of his routines.
It looks like none of the old hands use TIMEF() and perhaps there is a reason. The downside of system_clock and openmp_get_wtime for someone who isn't using timing regularly is that they don't leap out at you when you search the Intel help documentation, timef was the only time function I could find which reported time in mSec.
SYSTEM_CLOCK is a standard Fortran intrinsic subroutine. Unlike functions such as timef(), it will operate the same on all implementations. Many of the "portability library" routines vary in details across platforms.
I guess that redefines portable;-) I would have thought the whole point of the portability library was to provide the same experience across all platforms or if not at least acknowledge any variations in the help notes.
I did look at system_clock in help but as all the outputs were integers and I was trying for a quick fix it wasn't immediately obvious that it was able to count to mSec or even uSec as my re-reading this morning suggests. Timef appeared to give me what I wanted straight out of the box.
The resolution of SYSTEM_CLOCK varies, but you can ask what it is (COUNT_RATE). Yes, it takes a bit of extra code to convert that into numbers with fractions. But even like timef it is dependent on how often the OS updates the system clock.
The documentation of timef tells you what it does in Intel's implementation. I have seen (maybe not in the case of timef) other routines have differing interfaces and meanings across implementations, which is why I always prefer the Fortran intrinsics.
I think you're fooling yourself if you believe you'll be getting microsecond resolution out of timef.
RDTSC outputs in units of clock cycles (actually bus cycles) so it has resolution of a couple of nanoseconds. It's portable across platforms because every processor has a Time Stamp Counter. However, the code required to set it up varies with OS and processor family. Here's what worked for me with gfortran on ubuntu, plagiarizing code from
With the help of web pages like
module rdtsc_mod use ISO_C_BINDING implicit none ! We will not export anything but the pointer to the rdtsc function private ! Interface for rdtsc function abstract interface function rdtsc_iface() bind(C) import implicit none integer(C_INT64_T) rdtsc_iface end function rdtsc_iface end interface ! Define pointer to rdtsc function and initialize to point ! at initialization function procedure(rdtsc_iface), pointer, public :: rdtsc => rdtsc_init ! Typedef for off_t integer, parameter :: POSIX_OFF_T = C_LONG ! Constants required for mmap and mprotect ! Values used by gcc/ubuntu integer(C_INT), parameter :: & PROT_READ = int(Z'01',C_INT), & PROT_WRITE = int(Z'02',C_INT), & PROT_EXEC = int(Z'04',C_INT), & MAP_PRIVATE = int(Z'0002',C_INT), & MAP_ANONYMOUS = int(Z'0020',C_INT) type(C_PTR), parameter :: MAP_FAILED = transfer(-1_C_INTPTR_T,C_NULL_PTR) ! Interfaces for mmap and mprotect interface function mmap(addr,length,prot,flags, & fd,offset) bind(C,name='mmap') import implicit none type(C_PTR) mmap type(C_PTR), value :: addr integer(C_SIZE_T), value :: length integer(C_INT), value :: prot integer(C_INT), value :: flags integer(C_INT), value :: fd integer(POSIX_OFF_T), value :: offset end function mmap function mprotect(addr,len,prot) bind(C,name='mprotect') import implicit none integer(C_INT) mprotect type(C_PTR), value :: addr integer(C_SIZE_T), value :: len integer(C_INT), value :: prot end function mprotect end interface contains ! Initialization procedure for rdtsc. It will be called on the ! first invocation of rdtsc and sets up our real rdtsc function function rdtsc_init() bind(C) integer(C_INT64_T) rdtsc_init ! Machine code for 32-bit function integer(C_INT8_T), target :: BAD_STUFF_32(3) data BAD_STUFF_32 / & Z'0F', Z'31', & ! rdtsc Z'C3' / ! ret ! Machine code for 64-bit function integer(C_INT8_T), target :: BAD_STUFF_64(10) data BAD_STUFF_64 / & Z'0F', Z'31', & ! rdtsc Z'48', Z'C1', Z'E2', Z'20', & ! shl rdx, 32 Z'48', Z'09', Z'D0', & ! or rax, rdx Z'C3' / ! ret ! Pointer to machine code appropriate to address size integer(C_INT8_T), pointer :: code_ptr(:) ! Size of machine code integer(C_SIZE_T) code_size ! Address the OS allocates for our function via VirtualAlloc type(C_PTR) rdtsc_address ! Fortran pointer to write our function to integer(C_INT8_T), pointer :: rdtsc_code(:) ! Error status from mprotect integer(C_INT) status ! Point machine code pointer at code appropriate to ! address size and get code size if(bit_size(0_C_INTPTR_T) == 32) then code_ptr => BAD_STUFF_32 else code_ptr => BAD_STUFF_64 end if code_size = size(code_ptr,KIND=C_SIZE_T) ! Get writable address from OS to put our function in rdtsc_address = mmap( & addr = C_NULL_PTR, & length = code_size, & prot = iany([PROT_READ,PROT_WRITE,PROT_EXEC]), & flags = iany([MAP_PRIVATE,MAP_ANONYMOUS]), & fd = -1, & offset = 0_POSIX_OFF_T) ! If something goes wrong, abort if(transfer(rdtsc_address,0_C_INTPTR_T) == & transfer(MAP_FAILED,0_C_INTPTR_T)) then write(*,'(*(g0))') & 'rdtsc_init failed in mmap' stop end if ! Get Fortran pointer to allocated memory and poke our ! function into it. Then mark it as executable call C_F_POINTER(rdtsc_address,rdtsc_code,[code_size]) rdtsc_code = code_ptr status = mprotect( & addr = rdtsc_address, & len = code_size, & prot = iany([PROT_READ,PROT_EXEC])) ! If something goes wrong, abort if(status == -1) then write(*,'(*(g0))') & 'rdtsc_init failed in mprotect' stop end if ! Point the function pointer at the function we just poked into memory call C_F_PROCPOINTER(transfer(rdtsc_address,C_NULL_FUNPTR), & rdtsc) ! We still have to return the TSC value for transparency rdtsc_init = rdtsc() end function rdtsc_init end module rdtsc_mod program hello3 use rdtsc_mod use ISO_C_BINDING, only: C_INT64_T implicit none integer(C_INT64_T) t0(-1:10), tf(-1:10) integer i integer array(100) integer partials(10) interface function get_sum(array,upper) implicit none integer get_sum integer array(*), upper end function get_sum end interface array = [(i,i=1,size(array))] t0(-1) = rdtsc() write(*,'(*(g0))') 'Hello, world' tf(-1) = rdtsc() t0(0) = rdtsc() tf(0) = rdtsc() do i = 1, 10 t0(i) = rdtsc() partials(i) = get_sum(array,10*i) tf(i) = rdtsc() end do write(*,'(*(g0))') 'Time for hello = ',tf(-1)-t0(-1) write(*,'(*(g0))') 'Time for rdtsc = ',tf(0)-t0(0) do i = 1, 10 write(*,'(*(g0))') 'Partials(',i,') = ',partials(i),', time = ',tf(i)-t0(i) end do end program hello3 function get_sum(array,upper) implicit none integer get_sum integer array(*) integer upper get_sum = sum(array(1:upper)) end function get_sum
The output was:
Hello, world Time for hello = 154656 Time for rdtsc = 72 Partials(1) = 55, time = 612 Partials(2) = 210, time = 288 Partials(3) = 465, time = 324 Partials(4) = 820, time = 324 Partials(5) = 1275, time = 378 Partials(6) = 1830, time = 432 Partials(7) = 2485, time = 504 Partials(8) = 3240, time = 540 Partials(9) = 4095, time = 594 Partials(10) = 5050, time = 648
So it seemed to work. Does ifort have a built-in function equivalent to RDTSC?