precision of CPU_Time and System_Clock - Page 2

John_Campbell · ‎07-04-2012

There have been a number of comments about the precision of the standard timing routines available in ifort.
I have written a simple example which differenciates between the apparent precision of system_clock, given by Count_Rate, and the actual precision available from the different values returned.
I have run this example on ifort Ver 11.1, which I have installed.
It shows that both CPU_TIME and SYSTEM_CLOCK have only 64 ticks per second, which is very poor precision available via the Fortran standard intrinsic routines.
Better precisoin is available ( see QueryPerformanceCounter) and should be provided in these intrinsic routines.

John

Bernard · ‎02-06-2013

>>>Your elapsed time approach has the issue that on a desktop system it will be influenced by things such as the user moving the mouse (etc.) or other background operating system activities.>>>

The problem lies in unpredictable and pseudorandom(from the programmer's point of view) behaviour of the scheduler.

John_Campbell · ‎02-06-2013

Thanks for your feedback, especially Repeat Offender. I finally took your example code from last July and adapted my performance time testing to ifort. (see attached)
I found the call info I needed in Kernel32.f90 and was able to generate a test routine for elapsed time and CPU time info.

Only QueryPerformanceCounter or rdtsc provide timing info more accurately than 64 cycles per second, while all Fortran intrinsics are very poor.
SYSTEM_CLOCK (an elapsed time counter) should be changed to use either of these routines. Although it reports a clock rate of 1,000,000, the reality is it should report 64 !! (should be prosecuted for misrepresentation)

QueryPerformanceCounter, and QueryPerformanceFrequency ( 2.6 million cycles per second ) works well
rdtsc has a cycle rate = processor cycle rate (2.67 billion cycles per second ) works very well, but might have some problems.
These are both elapsed time counters.

I know of no CPU time accumulator that is updated at more than 64 times per second.
I have been prompted to do this review because of the poor performanc of the intrinsic routines offered by ifort. Where possible they should be improved.

I have not tested these routines for parallel operation.

John

Bernard · ‎02-07-2013

>>>QueryPerformanceCounter, and QueryPerformanceFrequency ( 2.6 million cycles per second ) works well rdtsc has a cycle rate = processor cycle rate (2.67 billion cycles per second ) works very well, but might have some problems. These are both elapsed time counters>>>

You can use RDTSC or QueryPerformanceCounter/QueryPerformanceFrequency timing functions(instruction) they are very accurate.RDTSC problem is the high latency and CPUID serialization which can add even more latency so for very short loops it is not recommended to use RDTSC .Afaik QueryPerformanceCounter uses HPET timer if you are interested here is an article about the HPET drawbacks : http://git.kernel.org/?p=linux/kernel/git/torvalds/linux.git;a=commit;h=54ff7e595d763d894104d421b103a89f7becf47c

John_Campbell · ‎02-07-2013

iliyapolak,

Thanks for your comment. What I have been trying to highlight is the poor performance of the fortran intrinsic timers, both elapsed and CPU. Those provided have an accuracy of 1/64 th of a second. This is a very long time interval for a processor that is typically cycling in excess of 2 ghz. I have identified alternatives for elapsed time but not for CPU time.

I had trouble understanding the problems you have identified with RDTSC or QueryPerformanceCounter. I'm not sure how significant they are, in comparison to the problem of using a timer accurate to .015 seconds. Certainly using a timer that says the elapsed time for a program example is zero is not very helpful.
I'm not sure at what testing frequency the problems you refer to become an issue. I would expect that being able to test at say 10,000 cycles per second ( which is once every 200,000 processor cycles ) is not a high frequency, given what a DO loop can do in 200,000 cycles. The point I am trying to make is that you need to be able to identify activity at better than once every 40 million processor cycles, which is what 1/64th second accuracy provides.

Anyway, the code I have attached demonstrates how to access these more accurate timers. I would recommend that SYSTEM_CLOCK use RDTSC or QueryPerformanceCounter, although I never have found a call to get the clock rate of RDTSC, which avoids the calibration approach. I actually have a file c:\processor_mhz.txt which stores the value, to avoid the calibration loop.

John

Bernard · ‎02-07-2013

Hi John

Thanks for your answer.Unfortunately I do not know fortran so I won't be able to offer you a helping hand in everything which is related to the programming in fortran. I will try to help you in everything which is related to time measurement on Windows platform.

TimP · ‎02-07-2013

We've pointed out several times that the OpenMP (omp_get_wtime) and MPI timers are much better than the Windows Fortran intrinsics, while system_clock (with integer(8) arguments) is satisfactory in the linux implementations of many compilers, including ifort.

This situation presents a dilemma with the implication that the timer intrinsics (other than possiblty OpenMP timers) can't be made portable between Windows and other operating systems.

On occasion, the Intel compiler team has met such challenges, as when the Microsoft __rdtsc intrinsic was added to Intel C and C++ for linux as well as Windows..

Bernard · ‎02-07-2013

>>>Thanks for your comment. What I have been trying to highlight is the poor performance of the fortran intrinsic timers, both elapsed and CPU>>>

QueryPerformanceCouter function which could probably access as user mode wrapper HEPT timer should be able to achieve 3 ms time precision on Windows 7 platform.HEPT timer supports periodic and aperiodic time measurement which is signalled by the interuppt firing.

>>>I had trouble understanding the problems you have identified with RDTSC>>>

Sorry for not explaining it more clearly.RDTSC itself is not more recommended to use on multithreaded and frequency-throttled CPU.Main reason is that your code can be scheduled to run on different CPU and operating system(ACPI) can lower the frequency of the CPU when the load is not significiant so the RDTSC as beign derived from the QPI clock is enough accurate for the time measurement.Microsoft recommends HEPT timer which is not dependend on the CPU.Moreover there is also an issue when you are measuring very short code blocks for example a few assembly instruction in such a situiation the measured code is shadowed by the longer latency of the RDTSC and CPUID which is used for serialization.So you need run your code thousand or hundred of times in order to remove RDTSC and CPUID latency.

Bernard · ‎02-07-2013

Here is a good article about the drawbacks of RDTSC and usage of this instruction to measure performance of the few instructions.

http://software.intel.com/en-us/forums/topic/306222

John_Campbell · ‎02-07-2013

Tim,

I don’t agree with your restrictions on the windows implementation of SYSTEM_CLOCK, because it would be different to Linux. There are many differences between the two O/S.

The fortran standard provided SYSTEM_CLOCK and CPU_TIME to provide a standard way of accessing these measures of performance. If OpenMP have identified better ways of providing this information, then the Intel implementation of the Fortran intrinsics should be improved. The point of these routines was to provide a more standard and convenient coding, while you are suggesting we should go back to a non-standard approach. This all takes effort for the ifort Fortran users, which could be better provided in the improved intrinsic routines.

I am not aware of the standard including multi-thread issues for these intrinsics, but the testing I have identified has not approached this problem either.

My reason for investigating all this is that the results from SYSTEM_CLOCK showed that the elapsed time for the routine I tested was zero, which is not a very helpful result.

I hope you might reconsider so that all other Fortran users of ifort do not have to go to the effort I have.

John

Bernard · ‎02-07-2013

>>>I don’t agree with your restrictions on the windows implementation of SYSTEM_CLOCK, because it would be different to Linux. There are many differences between the two O/S>>>

I suppose that hardware timers are the same on both OS's and timer's registers as seen by the both OS's are also the same.

TimP · ‎02-08-2013

"results from SYSTEM_CLOCK showed that the elapsed time for the routine I tested was zero"

As others pointed out, the actual tick rate (not the count_rate) for SYSTEM_CLOCK on Windows is in the range 0.01 to 1/64 Hz so you may measure 0 time for smaller intervals. On linux, SYSTEM_CLOCK will resolve intervals as small as microseconds when used correctly. It's non-portable only in this sense of poor resolution on Windows. I don't make the rules, and I agree entirety about the relative inconvenience of timing on Windows.

There are long-established benchmarks such as Livermore Fortran Kernels which perform an analysis to find out how many repetitions are needed to get satisfactory timing accuracy. That benchmark may take half an hour to run with a timer of 1/64 Hfz resolution or as little as 3 seconds by rdtsc. rdtsc of course exhibits degraded synchronization among CPUs in spite of the best efforts of the OS.

Benchmarks like lmbench which want to measure cache timing or other events which may take only a few CPU clock cycles must use more inconvenient techniques, with total lack of synchronization between CPUs.

Bernard · ‎05-10-2013

Because your routine ran too fast to be precisly measured by CPU_TIME.Even Win raw thread can run for shorter period than quantum which is based on clock interrupt.

John_Campbell · ‎05-10-2013

Tim,

Your repeated justification for SYSTEM_CLOCK having such unsuitable poor performance does not convince me or probably few others.
It is ridiculous that this situation should persist.

SYSTEM_CLOCK should be changed to use either QueryPerformanceCounter or RDTSC, both of which are much more suitable than the existing routine, which is probably GetTickCount.

John

Bernard · ‎05-11-2013

Completely agree with you.By the way if SYSTEM_CLOCK is really based on GetTickCount so the time measurement can be biased by sleep and hiberantion state.

Bernard · ‎05-11-2013

Because of two timing providers on Win platform where one of them is so called interrupt time and the other is called system time.It is not clear which one use Fortran timing routines.For short performance measurement of code execution more accurate system time should be applied.System time functionality is reperesented(accessed) by user mode QueryPerformanceCounter/QueryPerformanceFrequency pair of functions.

SergeyKostrov · ‎05-11-2013

>>...Even Win raw thread can run for shorter period than quantum which is based on clock interrupt... A test case with C/C++ sources, please! I really would like to see your test case that proves it.

Bernard · ‎05-11-2013

And how I can predict that my thread will run shorter than quantum period when the thread is executed in unpredictable environment.If I set the lowest priority how I can know and ensure that when my thread is running before quantum tick expires other more priviledged thread will be scheduled to run.There is option to create another thread more priviledged and can someone be sure that this thread will preempt the first thread even before the quantum expires.How you can be sure that higher interrupt will not run and preempt all those threads.Maybe you can shed some light on it.

Calling sleep(0) on currently executing thread will stop thread's execution before its quantum expires.

In pseudocode

int main(){

Handle currThHndl;

currThHandl = GetCurrentThread();

if(currThHandle == Null){

printf("Error obtaining current thread handle 0x%lx \n",GetLastError());

printf("Current thread pseudo handle is 0x%lx \n",currThHndl);

ExitProcess(0);

}

else

printf("GetCurrentThread successfuly called current thread pseudo handle is 0x%lx\n",currThHndl);

//calling sleep function with zero argument if successful thread will relinquish its quantum time and highest priority next ready thread wiil run.

sleep(0);

return 0;

}

Btw. the sentence quoted by you is taken from Windows Internals book in its 6 edition.And you can agree that this book was written by real experts on windows kernel.

Bernard · ‎05-12-2013

@Sergey

What makes you think that some internal(kernel mode) OS mechanism(behaviour) can be exactly measured or estimated by user mode client code?

SergeyKostrov · ‎05-12-2013

>>...the sentence quoted by you is taken from Windows Internals book in its 6 edition.... Hold on, please. You're always taking something from some books and not providing C/C++ sources inplemented by you which are proving, or disproving, what you've said. Sorry, but I don't see any evidence that you're doing serious programming and that is why you're quoting someone esle statements. Once again, my question is Could You prove it? ... raw thread can run for shorter period than quantum which is based on clock interrupt... And, What is a Raw Thread? Or, What is Not a Raw thread? Could you explain by yourself? I'd like to see two C/C++ examples of Raw thread and Not Raw thread. Of course, many IDZ users ( including me ) also quote MSDN, some articles and another docs. However, we do practical things and we can not be too theoretical all the time because many IDZ users have practical issues or problems and they need practical solutions.

SergeyKostrov · ‎05-12-2013

>>...What makes you think that some internal(kernel mode) OS mechanism(behaviour) can be exactly measured or >>estimated by user mode client code? I have Not started that discussion and please don't ask me with another question until you've answered my initial question. There has to be a dialog when some technical issues are discussed. You forgot that in some another thread I've provided a completed test-case to measure differences in values returned by RDTSC intrinsic instruction executed from several threads with accuracy to several nano-seconds ( of course this is Not absolutely accurate, however it will satisfy many performance measurement requirements ) in non-deterministic and non-realtime environment, like Windows XP or Windows 7. Actually, there are already two threads related to that subject: Forum Topic: Synchronizing Time Stamp Counter Web-link: software.intel.com/en-us/forums/topic/332570 Note: A test case is attached to my post dated on Tue, 11/06/2012 - 06:49 and Forum Topic: TSC Synchronization Across Cores Web-link: software.intel.com/en-us/forums/topic/388964

Bernard · ‎05-12-2013

Once again, my question is Could You prove it?

Is that book not enough for you.Do I need to prove implemetation of kernel scheduler.May you ask the authors of that book to prove thread quantum question.So according to your logic every technical sentence I will need to prove in order to satisfy you.

Raw thread it is windows thread not raw thread could be java thread which run on Win platform.

Some code examples not related to this discussion.