- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Has anyone used QueryPerformanceCounter with OpenMP enabled? Any suggestions?
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
James D,
I have used QueryPerformanceCounter with no problems.
QueryPerformanceCounter return integer(8) variables. If your function that computes elapse time based on a snapshot of the start time and a snapshot of the end time uses less precision then you may have a problem.
Bad coding would convert the counts to reals first then compute elapse time. Good coding would produce the delta time based on the difference of the integer(8) counts. Then convert the difference (elapse time) into REAL(8), convert the ticks per second to REAL(8), then produce the runtime in real(8) as elapseTicks / TicksPerSecond.
If you still have problems then check the intermediary values using the debugger.
Also, consider using the OMP_GET_WTIME() function which is platform independent and essentially does what you want. On Windows it calls QueryPerformanceCounter at some point. If you are timing very short term intervals then consider using QueryPerformanceCounter, othewise use the OpenMP library function.
Jim Dempsey
(from one James D to another)
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I forgot to mention. On multi-processor platforms attempts are made to keep the performance counters synchronized amongst the processors. The synchronizations can drift depending on configuration issues. An example of which might be the processor clock speed being altered for thermal considerations. Consider using code to assign the threadsto run on a specific processor. This is called set processor affinity. If the threads do not move around then the synchronization of the performance countersis not an issue. Note, if you are timing all threads instead of each thread then only the timming thread need be locked such that it observes one performance counter. Also note, that if a synchronization occures and if the timing processor is affected then the timing data is less accurate.Therefore make several runs, throw out the best and worst times and average the rest.
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
__rdtsc() may give better resolution on shorter time intervals, subject to the same precautions which Jim has enumerated.
Both __rdtsc() and QueryPerformance counters may fail when the rate of the underlying clock varies (e.g. for power saving). Current Intel platforms (beginning with Nocona) avoid this problem, since __rdtsc() actually is based on front side bus clock, even though it appears to count CPU clock ticks. It should be possible to measure elapsed time intervals from 1e-7 secondsto several hours.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Thanks for the additional input Tim.
The O/S has to enable __rdtsc() from Ring 3 in order to make it available to a user application without the requirement of causing a Trap to the O/S. I do not know if this is default behavior for each O/S on which your application runs.
Say Tim, would you know if anything attached to the FSB can request a longer clock cycle. e.g. if you have weird memory, ECC error, or FSB device wit unusual timing requirements?
For multiple cores in a package it is expected that they will share the same FSB. And I think for now, Intel multi-socket SMP systems share one FSB. But this may not hold true for much longer. Once multiple FSB's are employed then you fall into the synchronization problem again. A motherboard could be designed to have a master clock for all FSB's, provided that nothing on the bus can stretch a clock cycle for one of the busses but not the other(s).
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
By the way, below is the working version of the code that uses LARGE_INTEGER--it probably would be easier to skip kernel32 and pass a INTEGER(KIND=8) to QueryPerformanceCounter.
FUNCTION read_timer() USE kinds USE kernel32, ONLY: QueryPerformanceCounter,QueryPerformanceFrequency USE ifwinty INTEGER(i8b) :: read_timer TYPE(T_LARGE_INTEGER) :: freq, time_hack INTEGER(i8b) :: timer_freq INTEGER(BOOL) :: rc ! Always get the frequency because it can change rc = QueryPerformanceFrequency(freq) rc = QueryPerformanceCounter(time_hack) ! The LARGE_INTEGER type provides storage for a signed 64-bit integer and ! it is constructed using two 32 bit integers. To convert the ! LARGE_INTEGER type into one 64 bit in a portable fashion ! we need to do the following: ! 1) Multiply HighPart of LARGE_INTEGER by 2 ^ 32, which shifts it to the ! left by 32 bits. read_timer = time_hack%HighPart * 4294967296_i8b ! 2) Add the lower 31 bits of LowPart to the sum by masking out the ! sign bit (AND 0x7FFFFFFF). we need to ignore bit 32 because ! Fortran thinks it is the sign bit (all Integers are signed in Fortran). read_timer = read_timer + IAND(time_hack%LowPart,Z'7FFFFFFF') ! 3) Handle the sign bit of LowPart by checking to see if it is set. ! If it is, add 2^31 to the sum IF(BTEST(time_hack%LowPart,31)) & read_timer = read_timer + 2147483648_i8b timer_freq = freq%HighPart * 4294967296_i8b timer_freq = timer_freq + IAND(freq%LowPart,Z'7FFFFFFF') IF(BTEST(freq%LowPart,31)) & timer_freq = timer_freq + 2147483648_i8b ! Convert the timer ticks into microseconds (hence the 1000000) read_timer = read_timer / (timer_freq / 1000000_i8b) END FUNCTION read_timer
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
ETIME, generally speaking, is made available only for legacy compatibility. On the most common compilers, over the last 10 years, it duplicates the functionality of CPU_TIME. So, it usually attempts to report CPU time, not elapsed time. The resolution, at best, would be the same as CPU_TIME. For my own use, I write a function based on __rdtsc() which has the same calling data types as CPU_TIME(), so it is easy to switch.
Jim's recommendation to treat the 64-bit integers as plain 8-byte integers avoids the complication of treating them as pairs of 32-bit integers. Why use a compiler, if you aren't willing to let it do the work? It would be a long time before you would have to worry about signed vs unsigned 64-bit integers, except that the generated code for signed integers is likely to be more efficient. As Jim suggested, taking the required differences of 64-bit integers, then using double precision code for further calculations, gives you reasonable efficiency.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Examine this code:
! PerformanceCounter.f90
module
PerformanceCounteruse kernel32! Performance counter informationtype T_LARGE_INTEGER_OVERLAYunionmaptype(T_LARGE_INTEGER) :: liend mapmapinteger(8) :: i8 = 0end mapend unionend type T_LARGE_INTEGER_OVERLAYtype(T_LARGE_INTEGER_OVERLAY) :: PerformanceCounterFrequency_LARGE_INTEGER
real(8) :: PerformanceCounterFrequency_real8
type T_PERFORMANCECOUNTER
type(T_LARGE_INTEGER_OVERLAY) :: CountStart
type(T_LARGE_INTEGER_OVERLAY) :: CountEnd
real(8) :: RunTimeInSeconds = 0.
end type T_PERFORMANCECOUNTER
contains
! PerformanceCounterInit
! Call once at program initialization
! Determine the Performance Counter Frequency
! This assumes all processors use the same frequency
subroutine
PerformanceCounterInitinteger(BOOL) :: bTrash! Get tick frequency as T_LARGE_INTEGERbTrash = QueryPerformanceFrequency(PerformanceCounterFrequency_LARGE_INTEGER.li)
! Convert to real(8)PerformanceCounterFrequency_real8 =
dble(PerformanceCounterFrequency_LARGE_INTEGER.i8)end subroutine
PerformanceCounterInitsubroutine
PerformanceCounterStart(PerformanceCounter)type(T_PERFORMANCECOUNTER) :: PerformanceCounterinteger (BOOL) :: bTrash! Reset RunTimeInSeconds to 0.PerformanceCounter.RunTimeInSeconds = 0.
! Read Performance Counter into PerformanceCountStartbTrash = QueryPerformanceCounter(PerformanceCounter.CountStart.li)
end subroutine
PerformanceCounterStartsubroutine
PerformanceCounterResume(PerformanceCounter)type(T_PERFORMANCECOUNTER) :: PerformanceCounterinteger(BOOL) :: bTrash! Read Performance Counter into PerformanceCountStartbTrash = QueryPerformanceCounter(PerformanceCounter.CountStart.li)
end subroutine
PerformanceCounterResumesubroutine
PerformanceCounterEnd(PerformanceCounter)type(T_PERFORMANCECOUNTER) :: PerformanceCounterinteger(BOOL) :: bTrashbTrash = QueryPerformanceCounter(PerformanceCounter.CountEnd.li)
! compute and accumulate run time in secondsPerformanceCounter.RunTimeInSeconds = PerformanceCounter.RunTimeInSeconds &
& + (
dble(PerformanceCounter.CountEnd.i8 - PerformanceCounter.CountStart.i8) && / PerformanceCounterFrequency_real8)
end subroutine
PerformanceCounterEndend module
PerformanceCounter---
You may notice that the PerformanceCounterStart
function zeros out what would ordinarily be the
Elapse time. The purpose of doing it this way
is to provide for PerformanceCounterResumeThe functions provide for you to pause counting
time through a section of code that you do not
wish to be included in the performance calculation.An example would be if you wanted to exclude the I/O
time from the computational time.Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I considered the UNION approach, however, I was not sure how portable it is between compilers (IIRC it is not in the language specification).
As for TRANSFER, I must admit I didn't realize that it even existed. The approach I ended up taking was defining an interface to QueryPerformanceCounter where a INTEGER(KIND=8) was passed (all the host platforms I am running on support KIND=8).
I'm not quite sure how I can coax _rdtsc to give me elapsed cpu time. From my understanding tsc returns wall clock time.
The reason for this whole endeavour is that my runtime (wall clock)is not scaling with the number of cores at the rate I would expect. My first cut at improving multiprocessor performance was to see what gains would be achieved through the Math Kernel Library (I have many BLAS calls and alinearsystem (a moderate case is a8192x8192 sparse system) that is solved using the Direct Sparse Solver. As I change OMP_NUM_THREADS, the wall clock time stayed constant even though I can see the work being distributed over the processors. I think what this is telling me is that the MKL calls constitute a small portion of the runtime.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
There are problems and benefits with each method of implementation.
TRANSFER is a specification of the language whereas UNION is animplementation feature. So TRANSFER will be better for portability issues.
The disadvantage of TRANSFER is you must be cognizant of the transformation everywhere you use the intrinsic function. For example, just what does the mold mean when you supply 0.0 or 1. Which size real is it? Which size integer is it?
UNION ties the transformation to the TYPE definition of the structure. Specify the type properly once, then everywhere the transformation (cast) is correct.
Additionally, by definition in the Microsoft Platform SDK you know
typedef union _LARGE_INTEGER {
struct {
DWORD LowPart;
LONG HighPart;
};
struct {
DWORD LowPart;
LONG HighPart;
} u;
LONGLONG QuadPart; } LARGE_INTEGER,
*PLARGE_INTEGER;And the underlaying problem is the interface to theWin32 QueryPerformanceCounter is using T_LARGE_INTEGER
(without the UNION)
whereas in this case it would be more suitable to
use T_LONGLONG.Using T_LARGE_INTEGER (without the UNION) is technicallyinvalid. Use of DWORD is not representable in FORTRAN asFORTRAN does not comprehend the concept of unsigned integers.Therefore, requiring the use of TRANSFER also requires the
use of an unsupported data type (DWORD).The use of TRANSFER(unknown, known) is no different than
an obfuscated CAST.In the case of QueryPerformanceCounter the interface
should be declared to what it does (takes the address
of an INTEGER(8)) as opposed to taking a pointer toa type that is unsuitable for use.Or alternately declare T_LARGE_INTEGER as INTEGER(8).---- (enough of my brow beating) ----Steve, is there anything planned by the standards committeeto address issues such as unsigned integers and bit fields.It sure would be nice to bring Fortran up to the 1960's.Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
In the particular case here, you know that the source is an 8-byte record that is in fact an integer(8). Given that the use of a Windows API limits portability somewhat, I see no better choice than TRANSFER. It has the advantage of being obviouis what is happening at the point of use, whereas a non-standard UNION does not.
The standards committee is working on a "bits" feature for F2008. I am not familiar with the details - there is some discussion lately in comp.lang.fortrtan where some observe that it isn't really all that useful. The committee continues to decline to add unsigned types to the language.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
>>The standards committee is working on a "bits" feature for F2008. I am not familiar with the details - there is some discussion lately in comp.lang.fortrtan where some observe that it isn't really all that useful. The committee continues to decline to add unsigned types to the language.
And they must be experiencing a bad case of angst over a one bit field. Which (as signed) would have 0, -1 as the only permitted values. Consider
IF(aBit .eq. 0) aBit = 1
Would set -1 into aBit.
I expect bit fields to be defered another 25 years.
Jim Dempsey

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page