QueryPerformanceCounter and OpenMP

Dishaw__Jim · ‎02-24-2007

Based on Microsoft's documentation, I thought QueryPerformanceCounter should work in a multiprocessor environment. When I have OpenMP enabled, I can't get a consistent time, e.g. I compute a negative elapsed time.
Has anyone used QueryPerformanceCounter with OpenMP enabled? Any suggestions?

jimdempseyatthecove · ‎02-25-2007

James D,

I have used QueryPerformanceCounter with no problems.

QueryPerformanceCounter return integer(8) variables. If your function that computes elapse time based on a snapshot of the start time and a snapshot of the end time uses less precision then you may have a problem.

Bad coding would convert the counts to reals first then compute elapse time. Good coding would produce the delta time based on the difference of the integer(8) counts. Then convert the difference (elapse time) into REAL(8), convert the ticks per second to REAL(8), then produce the runtime in real(8) as elapseTicks / TicksPerSecond.

If you still have problems then check the intermediary values using the debugger.

Also, consider using the OMP_GET_WTIME() function which is platform independent and essentially does what you want. On Windows it calls QueryPerformanceCounter at some point. If you are timing very short term intervals then consider using QueryPerformanceCounter, othewise use the OpenMP library function.

Jim Dempsey

(from one James D to another)

jimdempseyatthecove · ‎02-25-2007

I forgot to mention. On multi-processor platforms attempts are made to keep the performance counters synchronized amongst the processors. The synchronizations can drift depending on configuration issues. An example of which might be the processor clock speed being altered for thermal considerations. Consider using code to assign the threadsto run on a specific processor. This is called set processor affinity. If the threads do not move around then the synchronization of the performance countersis not an issue. Note, if you are timing all threads instead of each thread then only the timming thread need be locked such that it observes one performance counter. Also note, that if a synchronization occures and if the timing processor is affected then the timing data is less accurate.Therefore make several runs, throw out the best and worst times and average the rest.

Jim Dempsey

TimP · ‎02-25-2007

__rdtsc() may give better resolution on shorter time intervals, subject to the same precautions which Jim has enumerated.

Both __rdtsc() and QueryPerformance counters may fail when the rate of the underlying clock varies (e.g. for power saving). Current Intel platforms (beginning with Nocona) avoid this problem, since __rdtsc() actually is based on front side bus clock, even though it appears to count CPU clock ticks. It should be possible to measure elapsed time intervals from 1e-7 secondsto several hours.

jimdempseyatthecove · ‎02-25-2007

Thanks for the additional input Tim.

The O/S has to enable __rdtsc() from Ring 3 in order to make it available to a user application without the requirement of causing a Trap to the O/S. I do not know if this is default behavior for each O/S on which your application runs.

Say Tim, would you know if anything attached to the FSB can request a longer clock cycle. e.g. if you have weird memory, ECC error, or FSB device wit unusual timing requirements?

For multiple cores in a package it is expected that they will share the same FSB. And I think for now, Intel multi-socket SMP systems share one FSB. But this may not hold true for much longer. Once multiple FSB's are employed then you fall into the synchronization problem again. A motherboard could be designed to have a master clock for all FSB's, provided that nothing on the bus can stretch a clock cycle for one of the busses but not the other(s).

Jim Dempsey

Dishaw__Jim · ‎02-25-2007

I believe I tracked down the problem--I forgot that all integers in Fortran are signed and that the 32nd bit of the lower 32 bits is the sign bit (I was using the LARGE_INTEGER type defined in ifwinty because the kernel32 module uses that type). On a related topic, is ETIME() a reliable method for getting elapsed processor time on multicore system?
By the way, below is the working version of the code that uses LARGE_INTEGER--it probably would be easier to skip kernel32 and pass a INTEGER(KIND=8) to QueryPerformanceCounter.

FUNCTION read_timer()
  USE kinds
  USE kernel32, ONLY: QueryPerformanceCounter,QueryPerformanceFrequency
  USE ifwinty

  INTEGER(i8b) :: read_timer
  TYPE(T_LARGE_INTEGER) :: freq, time_hack
  INTEGER(i8b) :: timer_freq
  INTEGER(BOOL) :: rc

  ! Always get the frequency because it can change
  rc = QueryPerformanceFrequency(freq)
  rc = QueryPerformanceCounter(time_hack)

  ! The LARGE_INTEGER type provides storage for a signed 64-bit integer and
  ! it is constructed using two 32 bit integers.  To convert the
  ! LARGE_INTEGER type into one 64 bit in a portable fashion
  ! we need to do the following:
  ! 1) Multiply HighPart of LARGE_INTEGER by 2 ^ 32, which  shifts it to the 
  !    left by 32 bits.
  read_timer = time_hack%HighPart * 4294967296_i8b

  ! 2) Add the lower 31 bits of LowPart to the sum by masking out the 
  !    sign bit (AND 0x7FFFFFFF).  we need to ignore bit 32 because 
  !    Fortran thinks it is the sign bit (all Integers are signed in Fortran).
  read_timer = read_timer + IAND(time_hack%LowPart,Z'7FFFFFFF')

  ! 3) Handle the sign bit of LowPart by checking to see if it is set.
  !    If it is, add 2^31 to the sum
  IF(BTEST(time_hack%LowPart,31)) &
       read_timer = read_timer + 2147483648_i8b

  timer_freq = freq%HighPart * 4294967296_i8b
  timer_freq = timer_freq + IAND(freq%LowPart,Z'7FFFFFFF')
  IF(BTEST(freq%LowPart,31)) &
       timer_freq = timer_freq + 2147483648_i8b

  ! Convert the timer ticks into microseconds (hence the 1000000)
  read_timer = read_timer / (timer_freq / 1000000_i8b)
END FUNCTION read_timer

TimP · ‎02-26-2007

ETIME, generally speaking, is made available only for legacy compatibility. On the most common compilers, over the last 10 years, it duplicates the functionality of CPU_TIME. So, it usually attempts to report CPU time, not elapsed time. The resolution, at best, would be the same as CPU_TIME. For my own use, I write a function based on __rdtsc() which has the same calling data types as CPU_TIME(), so it is easy to switch.

Jim's recommendation to treat the 64-bit integers as plain 8-byte integers avoids the complication of treating them as pairs of 32-bit integers. Why use a compiler, if you aren't willing to let it do the work? It would be a long time before you would have to worry about signed vs unsigned 64-bit integers, except that the generated code for signed integers is likely to be more efficient. As Jim suggested, taking the required differences of 64-bit integers, then using double precision code for further calculations, gives you reasonable efficiency.

jimdempseyatthecove · ‎02-26-2007

Examine this code:

! PerformanceCounter.f90

module
PerformanceCounter

use kernel32

! Performance counter information

type T_LARGE_INTEGER_OVERLAY

union

map

type(T_LARGE_INTEGER) :: li

end map

map

integer(8) :: i8 = 0

end map

end union

end type T_LARGE_INTEGER_OVERLAY

type(T_LARGE_INTEGER_OVERLAY) :: PerformanceCounterFrequency_LARGE_INTEGER

real(8) :: PerformanceCounterFrequency_real8

type T_PERFORMANCECOUNTER

type(T_LARGE_INTEGER_OVERLAY) :: CountStart

type(T_LARGE_INTEGER_OVERLAY) :: CountEnd

real(8) :: RunTimeInSeconds = 0.

end type T_PERFORMANCECOUNTER

contains

! PerformanceCounterInit

! Call once at program initialization

! Determine the Performance Counter Frequency

! This assumes all processors use the same frequency

subroutine
PerformanceCounterInit

integer(BOOL) :: bTrash

! Get tick frequency as T_LARGE_INTEGER

bTrash = QueryPerformanceFrequency(PerformanceCounterFrequency_LARGE_INTEGER.li)

! Convert to real(8)

PerformanceCounterFrequency_real8 =

dble(PerformanceCounterFrequency_LARGE_INTEGER.i8)

end subroutine
PerformanceCounterInit

subroutine
PerformanceCounterStart(PerformanceCounter)

type(T_PERFORMANCECOUNTER) :: PerformanceCounter

integer (BOOL) :: bTrash

! Reset RunTimeInSeconds to 0.

PerformanceCounter.RunTimeInSeconds = 0.

! Read Performance Counter into PerformanceCountStart

bTrash = QueryPerformanceCounter(PerformanceCounter.CountStart.li)

end subroutine
PerformanceCounterStart

subroutine
PerformanceCounterResume(PerformanceCounter)

type(T_PERFORMANCECOUNTER) :: PerformanceCounter

integer(BOOL) :: bTrash

! Read Performance Counter into PerformanceCountStart

bTrash = QueryPerformanceCounter(PerformanceCounter.CountStart.li)

end subroutine
PerformanceCounterResume

subroutine
PerformanceCounterEnd(PerformanceCounter)

type(T_PERFORMANCECOUNTER) :: PerformanceCounter

integer(BOOL) :: bTrash

bTrash = QueryPerformanceCounter(PerformanceCounter.CountEnd.li)

! compute and accumulate run time in seconds

PerformanceCounter.RunTimeInSeconds = PerformanceCounter.RunTimeInSeconds &

& + (

dble(PerformanceCounter.CountEnd.i8 - PerformanceCounter.CountStart.i8) &

& / PerformanceCounterFrequency_real8)

end subroutine
PerformanceCounterEnd

end module
PerformanceCounter

---

You may notice that the PerformanceCounterStart
function zeros out what would ordinarily be the
Elapse time. The purpose of doing it this way
is to provide for PerformanceCounterResume

The functions provide for you to pause counting
time through a section of code that you do not
wish to be included in the performance calculation.

An example would be if you wanted to exclude the I/O
time from the computational time.

Jim Dempsey

Steven_L_Intel1 · ‎02-26-2007

Consider as an alternative using TRANSFER to "cast" the LARGE_INTEGER type to an INTEGER(8).

Dishaw__Jim · ‎02-26-2007

I considered the UNION approach, however, I was not sure how portable it is between compilers (IIRC it is not in the language specification).

As for TRANSFER, I must admit I didn't realize that it even existed. The approach I ended up taking was defining an interface to QueryPerformanceCounter where a INTEGER(KIND=8) was passed (all the host platforms I am running on support KIND=8).

I'm not quite sure how I can coax _rdtsc to give me elapsed cpu time. From my understanding tsc returns wall clock time.

The reason for this whole endeavour is that my runtime (wall clock)is not scaling with the number of cores at the rate I would expect. My first cut at improving multiprocessor performance was to see what gains would be achieved through the Math Kernel Library (I have many BLAS calls and alinearsystem (a moderate case is a8192x8192 sparse system) that is solved using the Direct Sparse Solver. As I change OMP_NUM_THREADS, the wall clock time stayed constant even though I can see the work being distributed over the processors. I think what this is telling me is that the MKL calls constitute a small portion of the runtime.

Steven_L_Intel1 · ‎02-26-2007

You should check out Intel Thread Profiler. It is designed for just this sort of problem - to see what your threads are actually doing and where time is being wasted.

Dishaw__Jim · ‎02-26-2007

Just put in an order for it. Thanks for the tip

Steven_L_Intel1 · ‎02-26-2007

You're welcome. I had some training on Thread Profiler last year and I was impressed at the kind of information it could tease out of an application,including being able to take you directly to the source code of a call that was causing stalls. What often happens is that you may have multiple threads but a lot of the time is spent waiting for some event to occur making the program effectively serial, or the other threads finished early leaving the main thread to dominate the elapsed time.

jimdempseyatthecove · ‎02-26-2007

There are problems and benefits with each method of implementation.

TRANSFER is a specification of the language whereas UNION is animplementation feature. So TRANSFER will be better for portability issues.

The disadvantage of TRANSFER is you must be cognizant of the transformation everywhere you use the intrinsic function. For example, just what does the mold mean when you supply 0.0 or 1. Which size real is it? Which size integer is it?

UNION ties the transformation to the TYPE definition of the structure. Specify the type properly once, then everywhere the transformation (cast) is correct.

Additionally, by definition in the Microsoft Platform SDK you know

typedef union _LARGE_INTEGER {
  struct {
    DWORD LowPart;
    LONG HighPart;
  };
  struct {
    DWORD LowPart;
    LONG HighPart;
  } u;
  LONGLONG QuadPart;
} LARGE_INTEGER, 
*PLARGE_INTEGER;
And the underlaying problem is the interface to the
Win32 QueryPerformanceCounter is using T_LARGE_INTEGER
(without the UNION)
whereas in this case it would be more suitable to
use T_LONGLONG.
Using T_LARGE_INTEGER (without the UNION) is technically
invalid. Use of DWORD is not representable in FORTRAN as
FORTRAN does not comprehend the concept of unsigned integers.
Therefore, requiring the use of TRANSFER also requires the
use of an unsupported data type (DWORD).
The use of TRANSFER(unknown, known) is no different than
an obfuscated CAST.
In the case of QueryPerformanceCounter the interface
should be declared to what it does (takes the address
of an INTEGER(8)) as opposed to taking a pointer to
a type that is unsuitable for use.
Or alternately declare T_LARGE_INTEGER as INTEGER(8).
---- (enough of my brow beating) ----
Steve, is there anything planned by the standards committee
to address issues such as unsigned integers and bit fields.
It sure would be nice to bring Fortran up to the 1960's.
Jim Dempsey

Steven_L_Intel1 · ‎02-26-2007

In the particular case here, you know that the source is an 8-byte record that is in fact an integer(8). Given that the use of a Windows API limits portability somewhat, I see no better choice than TRANSFER. It has the advantage of being obviouis what is happening at the point of use, whereas a non-standard UNION does not.

The standards committee is working on a "bits" feature for F2008. I am not familiar with the details - there is some discussion lately in comp.lang.fortrtan where some observe that it isn't really all that useful. The committee continues to decline to add unsigned types to the language.

jimdempseyatthecove · ‎02-26-2007

>>The standards committee is working on a "bits" feature for F2008. I am not familiar with the details - there is some discussion lately in comp.lang.fortrtan where some observe that it isn't really all that useful. The committee continues to decline to add unsigned types to the language.

And they must be experiencing a bad case of angst over a one bit field. Which (as signed) would have 0, -1 as the only permitted values. Consider

IF(aBit .eq. 0) aBit = 1

Would set -1 into aBit.

I expect bit fields to be defered another 25 years.

Jim Dempsey