Intel® Fortran Compiler
Build applications that can scale for the future with optimized code designed for Intel® Xeon® and compatible processors.
Announcements
FPGA community forums and blogs have moved to the Altera Community. Existing Intel Community members can sign in with their current credentials.

precision of CPU_Time and System_Clock

John_Campbell
New Contributor II
24,107 Views
There have been a number of comments about the precision of the standard timing routines available in ifort.
I have written a simple example which differenciates between the apparent precision of system_clock, given by Count_Rate, and the actual precision available from the different values returned.
I have run this example on ifort Ver 11.1, which I have installed.
It shows that both CPU_TIME and SYSTEM_CLOCK have only 64 ticks per second, which is very poor precision available via the Fortran standard intrinsic routines.
Better precisoin is available ( see QueryPerformanceCounter) and should be provided in these intrinsic routines.

John
87 Replies
Bernard
Valued Contributor I
6,992 Views

Examples of Not raw threads in this case Java threads.

0 Kudos
SergeyKostrov
Valued Contributor II
6,992 Views
>>Do I need to prove implemetation of kernel scheduler. No. >>May you ask the authors of that book to prove thread quantum question. I simply want to understand how it could be possible, if it is possible at all, regarding your statement about an execution of a Win32 thread. I'd like to bring clarity in your stetement and nothing else (!).
0 Kudos
Bernard
Valued Contributor I
6,992 Views

Sorry for creating a confusion.

I think that sleep() function called with arg == 0 can simulate such a behaviour when thread is stopped before quantum expires.One interesting question arises which is related to exact moment of quantum interval when the execution is postponed and how to control it programmatically.If  new thread is created from within the main function thread and that new thread has priority rised to high and it is scheduled to run immediatly after creation so how (inside thread's function) or better when sleep(0) call will be executed in order for example to stop the execution after 1/2 of quantum expires.

0 Kudos
John_Campbell
New Contributor II
6,992 Views

I have again been reviewed the information I have available on the accuracy of different timing routines for CPU or elapsed time. I have attached an updated set of fortran calls for 6 timing routines I have identified. I would recommend these as my best use of teh identified API routines. Any recomendations for improvement would be appreciated.
The timing test program has been improved to test each routine for about 5 seconds.
RDTSC requires an initialising routine to estimate the returned tick frequency which is the processor rate for my test machines.

I have identified 2 that are good for elapsed time : RDTSC and QueryPerformanceCounter
All other routines update their time value at 64 cycles per second.
It would be good if there was a more accurate CPU time routine, but I have not found it. I should see what OpenMP uses !
Again, I would recommend that SYSTEM_CLOCK should be fixed in ifort so that we can reliably use the Fortran intrinsic routine.

The following table summarises the performance of the 6 routines I have identified.
[plain]
Routine                 Ticks per  CPU cycles  Notes
                           second    per call
RDTSC                    88514093          30  ticks at processor rate, accuracy limited by call rate
QueryPerformanceCounter   2594669          47  possibly more robust than RDTSC
GetTickCount                   64          14  fast, but poor precision
system_clock                   64         325  

GetProcessTimes                64         386  poor precision but best identified for CPU
CPU_Time                       64         387
[/plain]
Ticks per second : is the number of unique time values returned in a second ( best accuracy that can be achieved )
CPU cycles per call : is the number of processor cycles per routine call ( call overhead )

John

( I am hoping the plain text preserves the courier font layout of the table)

0 Kudos
SergeyKostrov
Valued Contributor II
6,992 Views
>>Routine Ticks per CPU cycles Notes >> second per call >>RDTSC 88514093 30 ticks at processor rate, accuracy limited by call rate >>QueryPerformanceCounter 2594669 47 possibly more robust than RDTSC >>GetTickCount 64 14 fast, but poor precision >>system_clock 64 325 >>GetProcessTimes 64 386 poor precision but best identified for CPU >>CPU_Time 64 387 Thanks, John! These numbers are really interesting. In 99% GetTickCount satisfies requirements I have. Did you take into account that your test was executed in Non-Deterministic environment of some Windows operationg system? I simply wanted to say that in order to make these measurements as accurate as possible you need to boost a priority of your process to High or Realtime. In that case your process will preempt threads with lower priority currently executed on a system and they won't affect accuracy of measurements. Also, Patrick Fay ( Intel ) recommends to do such tests on a different CPU instead of the 1st one ( it is named as CPU 0 in the Task Manager ).
0 Kudos
Bernard
Valued Contributor I
6,992 Views

Thank you John great job.

@Sergey returning to your question I have simple multithreaded Win32 threads program which uses Sleep() function to terminate its currently running thread so such a action can simulate what I wrote in one of my previous post.So far I was unable and I do not know if it is possible to relinquish the cpu time at some point during the quantum interval.

0 Kudos
John_Campbell
New Contributor II
6,992 Views

Sergey,

For elapsed time, RDTSC is the best for me as it takes 30 processor cycles and gives a high precision ( 88 million ticks per second, which is the call rate). While GetTickCount is faster to run ( only 14 processor cycles) it has very poor precision ( 64 ticks per second ) so it is not useful for reporting short elapsed time tests.
I have not tested the accuracy of these timers, over a short or long duration. For the types of testing I do, this is not as significant as there are many external distractions to the meaning of run times, such as other process interuptions. My aim has been to get an indication of relative elapsed times for different programming approaches.

Thats elapsed time, however when it comes to CPU time, the best has precision to only 1/64 second. I can not find anything with better precision.
When it comes to timing processes, and OpenMP coding, the elapsed time is what matters, while the CPU time to elapsed time ratio gives an indication of how many threads are effectively running simultaneously.

Unfortunately I have not achieved very good ratios for the OpenMP programs I have been developing. While I can get multiple threads to run, I am getting clashes in other areas. I'm being told cache clashes are my latest problem, so an efffective OpenMP solution, using ifort Ver 2011 is a way off.

John

0 Kudos
Bernard
Valued Contributor I
6,992 Views

>>>While GetTickCount is faster to run ( only 14 processor cycles)>>>

Do you mean total time needed to execute this instruction from user mode stub through the switching to kernel mode?

0 Kudos
SergeyKostrov
Valued Contributor II
6,992 Views
[ Iliya wrote ] >>...So far I was unable and I do not know if it is possible to relinquish the cpu time at some point during the quantum interval... That's not a problem and the Negative Result is also Result because it proves or disproves something. Thanks for the update.
0 Kudos
SergeyKostrov
Valued Contributor II
6,992 Views
Iliya, I just looked at my sources and I found the following comment: ... // Overhead of Sleep( 0 ): Debug~=1562 clocks / Release~=1525 clocks ... So, it is clrear that CPU will do something during that period of time. Wouldn't be better to discuss all that C/C++ stuff in another thread in a different forum?
0 Kudos
John_Campbell
New Contributor II
6,992 Views

iliyapolak,

There are a number of attributes of the timing routines I have investigated, including:
- How fast it runs: The number of processor cycles a call to this timing routines takes.
- How precise it is: How frequently the returned time measure is updated. This indicates how useful this timing can be for short duration events.
- How accurate it is: The accuracy of the reported time over a longer period. I have not concentrated on this aspect of performance.

My interest in how many processor cycles the call takes has not been concerned with what happens in the timing routine when it is called. Your discussion with Sergey about Kernel scheduler etc, which I understand is what is taking place in the timer routine, does not have a significant effect on the way I use these routines.

Over the last 20 years, processor rates have improved by over 1,000 times from 1 mhz to 3 ghz. Unfortunately the precision of some timers has not matched this improvement, to the extent that they now give poor performance for what program developers require of them.

The purpose of my post has been to:
- Highlight the poor performance of the standard Fortran intrinsics available in ifort,
- Identify there are better alternatives for SYSTEM_CLOCK, which I hope could be adopted into ifort, and
- Point out that I have not been able to locate a better routine for CPU_TIME.

I was hoping that someone in this Forum might know a suitable routine and be able to provide a simple fortran code example for ifort on how to use it. I remain hopeful someone might be able to help

John

0 Kudos
Bernard
Valued Contributor I
6,992 Views

Hi John,

I am not questioning your findings I only asked it as a matter of interest.

Yes I agree with you than Fortran developer should not be concerned with internal implementation of some timing routine.It is not their task.The situation with the precision of system timers  I think that low precision could be directly related in (some cases) to multimedia requirements of the modern OS and to system management(thread scheduling).

0 Kudos
Bernard
Valued Contributor I
6,992 Views

Sergey Kostrov wrote:

Iliya,

I just looked at my sources and I found the following comment:
...
// Overhead of Sleep( 0 ): Debug~=1562 clocks / Release~=1525 clocks
...
So, it is clrear that CPU will do something during that period of time. Wouldn't be better to discuss all that C/C++ stuff in another thread in a different forum?

Yes that is true.I think that at the time of call to sleep function calling thread could be put immediately in standby state or it could run for some miniscule time period untill scheduling decision is made.What I have been able to understand that on multiprocessor system scheduler database is locked during finding the next runnable thread.So during the long processing time of sleep()  database is locked and no other cpu can make scheduling decision.

If you are interested I can create new thread for this discussion,but which IDZ forum to choose for it?

0 Kudos
IanH
Honored Contributor III
6,992 Views

I might be covering old ground here - but you mention your use of OMP.  On ifort the implementation of OMP_GET_WTIME uses QueryPerformanceCounter.

There are differences in the requirements between SYSTEM_CLOCK and OMP_GET_WTIME in terms of their standard definitions - OMP_GET_WTIME is more relaxed in some ways (it is a thread specific wall time), so that might be part of the reason for the different implementation.  (I see mention of system bugs on the QueryPerformanceCounter msdn page that would be problematic for SYSTEM_CLOCK.) 

Further, Intel's docs ascribe a particular meaning to the zero SYSTEM_CLOCK time.  I suspect if they were to change their implementation from using GetLocalTime to QueryPerformanceCounter they might have to lose that meaning.  Not sure.  If that was the case, that could annoy some users relying on the previously documented behaviour.

Again, this might have already been covered (or be obvious from your table) but CPU_TIME is implemented by calling GetProcessTimes and summing the user and kernel time.  Given its definition I don't see how CPU_TIME could be implemented differently; then given the way the Windows scheduler works and the possibility for the program to have multiple threads on multiple processors, I think it is unrealistic to expect GetProcessTimes to have better precision than it does.

(The reason that GetTickCount is pretty snappy cycle wise is that the tick count is available in user space - no kernel mode transition there.)

0 Kudos
Bernard
Valued Contributor I
6,992 Views

>>>(The reason that GetTickCount is pretty snappy cycle wise is that the tick count is available in user space - no kernel mode transition there.)>>>

Yes that's true.I have found a possible implementation of GetTickCount and this function accesses SharedUserData structure in its caller process address space hence the very fast execution time.I was simply confused by existence of KeGetTickCount which is used by drivers.

Thanks for va;uable information.

0 Kudos
SergeyKostrov
Valued Contributor II
6,992 Views
>>...I can create new thread for this discussion,but which IDZ forum to choose for it?.. Since this is Not related to Intel software it would be nice to create in: Watercooler Catchall software.intel.com/en-us/forums/watercooler-catchall
0 Kudos
TimP
Honored Contributor III
6,992 Views

Sergey Kostrov wrote:

>>...I can create new thread for this discussion,but which IDZ forum to choose for it?..

Since this is Not related to Intel software it would be nice to create in:

Watercooler Catchall
software.intel.com/en-us/forums/watercooler-catchall

threading forum is not necessarily restricted to Intel software if it concerns Intel platforms

0 Kudos
SergeyKostrov
Valued Contributor II
6,992 Views
>>...Over the last 20 years, processor rates have improved by over 1,000 times from 1 mhz to 3 ghz. Unfortunately the precision >>of some timers has not matched this improvement, to the extent that they now give poor performance for what program >>developers require of them... Here are results of three tests ( implemented in C with inline assembler ) on different CPUs: Intel(R) Core i7-3840QM 2.80GHz ( 4 cores / Ivy Bridge ) ... Test-Case 1 - Overhead of RDTSC instruction ... RDTSC Overhead Value: 24.000 clock cycles ... Intel(R) Atom(TM) CPU N270 1.60GHz ( 2 cores / Atom ) ... Test-Case 1 - Overhead of RDTSC instruction ... RDTSC Overhead Value: 24.000 clock cycles ... Intel Intel(R) Pentium(R) 4 CPU 1.60GHz ( 1 core / Pentium ) ... Test-Case 1 - Overhead of RDTSC instruction ... RDTSC Overhead Value: 84.000 clock cycles ...
0 Kudos
SergeyKostrov
Valued Contributor II
6,990 Views
More detailed with a screenshot... ... Test-Case 1 - Overhead of RDTSC instruction REAL TIME TIME CRITICAL RDTSC Overhead Value: 24.000 cycles Test-Case 2 - Switching CPUs at runtime Switched to CPU1 - Previous Thread AM: 255 - Error Code: 0 Switched to CPU1 - Previous Thread AM: 16 - Error Code: 0 - Thread Affinity: 1 Switched to CPU2 - Previous Thread AM: 1 - Error Code: 0 - Thread Affinity: 2 Switched to CPU3 - Previous Thread AM: 2 - Error Code: 0 - Thread Affinity: 4 Switched to CPU4 - Previous Thread AM: 4 - Error Code: 0 - Thread Affinity: 8 Switched to CPU5 - Previous Thread AM: 8 - Error Code: 0 - Thread Affinity: 16 Switched to CPU6 - Previous Thread AM: 16 - Error Code: 0 - Thread Affinity: 32 Switched to CPU7 - Previous Thread AM: 32 - Error Code: 0 - Thread Affinity: 64 Switched to CPU8 - Previous Thread AM: 64 - Error Code: 0 - Thread Affinity: 128 Test-Case 3 - Retrieving RDTSC values for CPUs - 1 RDTSC for CPU1 : 40122001028576 RDTSC for CPU2 : 40122001036608 RDTSC Difference: 8032 ( RDTSC2 - RDTSC1 ) dwThreadAMPrev1 : 128 ( Processing Error if 0 ) dwThreadAMPrev2 : 1 ( Processing Error if 0 ) Test-Case 4 - Retrieving RDTSC values for CPUs - 2 Threads 1 and 2 created RDTSC values ( in CPU clocks ): Iteration Thread 1 Thread 2 Difference 00 40135961623344 40135961623763 -419 01 40135961623372 40135961623815 -443 02 40135961623400 40135961623851 -451 03 40135961623440 40135961623907 -467 04 40135961623468 40135961623935 -467 05 40135961623496 40135961623963 -467 06 40135961623544 40135961624003 -459 07 40135961623568 40135961624031 -463 08 40135961623596 40135961624055 -459 09 40135961623624 40135961624083 -459 10 40135961623664 40135961624123 -459 11 40135961623688 40135961624151 -463 12 40135961623716 40135961624183 -467 13 40135961623764 40135961624235 -471 14 40135961623800 40135961624263 -463 15 40135961623836 40135961624291 -455 Statistics: Thread 1 started at 40135961623316 Thread 2 started at 40135961623511 Difference -195 Thread 1 completed at 40135961623896 Thread 2 completed at 40135961624335 Difference -439 dwThreadAMPrev[0]: 255 ( Processing Error if 0 ) dwThreadAMPrev[1]: 255 ( Processing Error if 0 ) ... rdtscoverhead.jpg
0 Kudos
Bernard
Valued Contributor I
6,990 Views

>>>Threading forum is not necessarily restricted to Intel software if it concerns Intel platforms>>>

Tim do you mean Threading Building Blocks forum?

0 Kudos
SergeyKostrov
Valued Contributor II
6,990 Views
By the way, did you read that article? Nanosecond-precision Test Web-link: zeromq.org/results:more-precise-0mq-tests
0 Kudos
Reply