>>>I think there are multiple - Page 4

John_Campbell · ‎07-04-2012

There have been a number of comments about the precision of the standard timing routines available in ifort.
I have written a simple example which differenciates between the apparent precision of system_clock, given by Count_Rate, and the actual precision available from the different values returned.
I have run this example on ifort Ver 11.1, which I have installed.
It shows that both CPU_TIME and SYSTEM_CLOCK have only 64 ticks per second, which is very poor precision available via the Fortran standard intrinsic routines.
Better precisoin is available ( see QueryPerformanceCounter) and should be provided in these intrinsic routines.

John

Bernard · ‎05-16-2013

>>...Over the last 20 years, processor rates have improved by over 1,000 times from 1 mhz to 3 ghz. Unfortunately the precision
>>of some timers has not matched this improvement, to the extent that they now give poor performance for what program
>>developers require of them.

I think that John is talking mainly about the system timers like RTC.

John_Campbell · ‎05-16-2013

iliyapolak,

Of the 6 timers I tested, 4 are updated 64 times per second, which includes both Fortran intrinsics. That is once evert 42 million processor cycles, which contrasts dramatically with the CPU clocks reported in Sergey's recent posts. The following table summarises teh performance I have obtained from the program I attached recently, where:

Routine : is the name of the timing routine being tested ( Fortran intrinsic or API routine)
Elapse : is the duration of the test
Ticks : is the number of changes of time reported during the test
Calls : is the number of times the routine has been called during the test,
Time1 : is the time between tick cycles, timed with QueryPerformanceCounter
Time2 : is the time between tick cycles, timed with test routine
inc : is the average increment in timing value, reported by the test routine
ticks/sec : is the average ticks per second (time changes per second)
cpu : is the estimate of how long the routine takes to return the time value, in processor cycles

Note : my processor runs at 2.67 ghz, so the slowest (CPU_Time) returns the estimate in 0.15 micro seconds or 6.8 million calls per second. This is not a significant overhead for polling the time at a friequency of more than 64 cycles per second.
RDTSC looks like a much better timer for elapsed time estimates.

[plain]
Test Routine               Elapse     Ticks      Calls     time1     time2       inc ticks/sec   cpu
   1 system_clock           5.070       324   40845007 1.565E-02 1.565E-02     15647        64   331
   2 CPU_Time               5.101       326   34510251 1.565E-02 1.565E-02     15648        64   394
   3 GetTickCount           6.287       402 887790725 1.564E-02 1.564E-02        15        64    18
   4 QueryPerformanceCoun   5.632 14625772 283723396 3.851E-07 3.853E-07         1   2595206    52
   5 GetProcessTimes        5.038       322   34090182 1.565E-02 1.565E-02    156486        64   394
   6 RDTSC API              5.289 459827584 459827584 1.150E-08 1.149E-08        30 87008169    30
[/plain]

TimP · ‎05-17-2013

iliyapolak wrote:

>>>Threading forum is not necessarily restricted to Intel software if it concerns Intel platforms>>>

Tim do you mean Threading Building Blocks forum?

http://software.intel.com/en-us/forums/threading-on-intel-parallel-architectures

Bernard · ‎05-17-2013

TimP (Intel) wrote:

Quote:

iliyapolakwrote:
>>>Threading forum is not necessarily restricted to Intel software if it concerns Intel platforms>>>

Tim do you mean Threading Building Blocks forum?

http://software.intel.com/en-us/forums/threading-on-intel-parallel-archi...

Thanks Tim.

This forum seems a good place for my thread:)

Bernard · ‎05-17-2013

>>>Nanosecond-precision Test
Web-link: zeromq.org/results:more-precise-0mq-tests>>>

Great article.Thanks for link.

SergeyKostrov · ‎05-17-2013

In the article ( link is posted above ) there is a number for RDTSC overhead on Intel Pentium 4 CPU: ... Intel(R) Pentium(R) 4 3 GHz ... 33 ns and if I normalize my result: Intel(R) Pentium(R) 4 1.60GHz ... 84 clock cycles to their ( with conversion from clock sycles to ns and GHz normalization ) then this is what I will get: Normalized: ( 1 / 1600000000 * 84 ) / ( 3.0 / 1.6 ) = ( 52.5 ) / ( 1.875 ) = 28 ns There is ~15% difference in accuracy of measurements. Unfortunately, it is Not clear how exactly they measured these numbers ( I could post my codes and they are very simple ).

Bernard · ‎05-17-2013

Hi John,

I think that precision of 4 timers(64hz) frequency is directly related to real-time multimedia needs and to OS thread scheduling.For example quantum interval on Windows runs for two timer intervals ~20ms(it can be measured with Clockres tool) for client version so in this case granularity of RDTSC I think is not needed.I think that RDTSC is better option when small portion of code like loops are measured.

TimP · ‎05-17-2013

Sergey Kostrov wrote:

In the article ( link is posted above ) there is a number for RDTSC overhead on Intel Pentium 4 CPU:
...
Intel(R) Pentium(R) 4 3 GHz ... 33 ns

and if I normalize my result:

Intel(R) Pentium(R) 4 1.60GHz ... 84 clock cycles

to their ( with conversion from clock sycles to ns and GHz normalization ) then this is what I will get:

Normalized: ( 1 / 1600000000 * 84 ) / ( 3.0 / 1.6 ) = ( 52.5 ) / ( 1.875 ) = 28 ns

There is ~15% difference in accuracy of measurements. Unfortunately, it is Not clear how exactly they measured these numbers ( I could post my codes and they are very simple ).

RDTSC is so much different on CPUs of recent years than it was on P4 that I don't see much relevance in comparisons with such old data. Evidently, on recent CPUs, as RDTSC measures baseline clock ticks and multiplies them by the nominal multiplier, it can be expected not to increment during the baseline clock interval. RDTSC may access a local core clock so as not to have an access latency dependent on which core accesses it, but of course if you take care to return to the same core for consistency you have greater overhead. The latter might account for behavior of Windows QueryPerformance.

Intel CPUs which have an ETC counter shared by all cores tell you that latency is many times that of RDTSC.

SergeyKostrov · ‎05-17-2013

>>...RDTSC is so much different on CPUs of recent years than it was on P4 that I don't see much relevance in comparisons >>with such old data... It actually doesn't matter because John ( in Fortran ) and I ( in C/C++ ) are already developed several simple techniques to measure intervals with as better as possible accuracy. There is nothing wrong with using 3 different generations of Intel CPUs in my tests.

SergeyKostrov · ‎05-17-2013

John, I wonder if you experienced something like this? Measured Overhead values of RDTSC instruction are Not the same during a test. Take a look at a log I've recorded: [ Pentium 4 1.6GHz CPU with Microsoft C++ compiler 32-bit ] ... RDTSC Overhead Value: 84.000 clock cycles RDTSC Overhead Value: 84.000 clock cycles ... RDTSC Overhead Value: 88.000 clock cycles RDTSC Overhead Value: 88.000 clock cycles RDTSC Overhead Value: 88.000 clock cycles RDTSC Overhead Value: 88.000 clock cycles RDTSC Overhead Value: 84.000 clock cycles RDTSC Overhead Value: 84.000 clock cycles ... RDTSC Overhead Value: 84.000 clock cycles RDTSC Overhead Value: 88.000 clock cycles RDTSC Overhead Value: 88.000 clock cycles RDTSC Overhead Value: 88.000 clock cycles RDTSC Overhead Value: 88.000 clock cycles RDTSC Overhead Value: 88.000 clock cycles RDTSC Overhead Value: 88.000 clock cycles RDTSC Overhead Value: 84.000 clock cycles ... RDTSC Overhead Value: 84.000 clock cycles RDTSC Overhead Value: 88.000 clock cycles RDTSC Overhead Value: 84.000 clock cycles ... RDTSC Overhead Value: 84.000 clock cycles RDTSC Overhead Value: 88.000 clock cycles RDTSC Overhead Value: 84.000 clock cycles ... RDTSC Overhead Value: 84.000 clock cycles RDTSC Overhead Value: 88.000 clock cycles RDTSC Overhead Value: 84.000 clock cycles ... RDTSC Overhead Value: 84.000 clock cycles RDTSC Overhead Value: 88.000 clock cycles RDTSC Overhead Value: 88.000 clock cycles RDTSC Overhead Value: 88.000 clock cycles RDTSC Overhead Value: 84.000 clock cycles RDTSC Overhead Value: 88.000 clock cycles RDTSC Overhead Value: 88.000 clock cycles RDTSC Overhead Value: 88.000 clock cycles RDTSC Overhead Value: 84.000 clock cycles ... Note: It happens even after I boosted priority of the thread to realtime and used a critical section around a piece of code that does all measurements. I think this is expected because Windows interrupts my processing to do something else.

SergeyKostrov · ‎05-17-2013

Here is another set of results: Test-Case 1.3 measures how many CPU clock cycles are spent on assignment of a value returned by RDTSC instruction to a 64-bit variable and when a Final RDTSC Overhead Value is calculated it is taken into account, that is subtracted. [ Tests on Pentium system ] [ Pentium 4 1.6GHz CPU with Microsoft C++ compiler 32-bit ] ... Test-Case 1.1 - Overhead of RDTSC instruction Min RDTSC Overhead Value : 84.000 clock cycles Test-Case 1.2 - Overhead of RDTSC instruction Total Delta Value : 847384512 clock cycles Avg RDTSC Overhead Value : 84.738 clock cycles Test-Case 1.3 - Overhead of Assignment of a Value from RDTSC instruction Min Overhead of Assignment: 4.000 clock cycles Final RDTSC Overhead Value: 80.000 clock cycles ... [ Pentium 4 1.6GHz CPU with Intel C++ compiler 32-bit ] ... Test-Case 1.1 - Overhead of RDTSC instruction Min RDTSC Overhead Value : 84.000 clock cycles Test-Case 1.2 - Overhead of RDTSC instruction Total Delta Value : 791481152 clock cycles Avg RDTSC Overhead Value : 79.148 clock cycles Test-Case 1.3 - Overhead of Assignment of a Value from RDTSC instruction Min Overhead of Assignment: 4.000 clock cycles Final RDTSC Overhead Value: 80.000 clock cycles ... Average RDTSC Overhead Value is 79.148 clock cycles and this is another prove that actual overhead is about ~80 clock sycles instead of 84 clock sycles.

SergeyKostrov · ‎05-17-2013

[ Tests on Atom system ] [ Atom N270 1.6GHz CPU with Microsoft C++ compiler 32-bit ] Test-Case 1.1 - Overhead of RDTSC instruction Min RDTSC Overhead Value : 24.000 clock cycles Test-Case 1.2 - Overhead of RDTSC instruction Total Delta Value : 301894880 clock cycles Avg RDTSC Overhead Value : 30.189 clock cycles Test-Case 1.3 - Overhead of Assignment of a Value from RDTSC instruction Min Overhead of Assignment: 0.000 clock cycles Final RDTSC Overhead Value: 24.000 clock cycles [ Atom N270 1.6GHz CPU with Intel C++ compiler 32-bit ] Test-Case 1.1 - Overhead of RDTSC instruction Min RDTSC Overhead Value : 24.000 clock cycles Test-Case 1.2 - Overhead of RDTSC instruction Total Delta Value : 306208064 clock cycles Avg RDTSC Overhead Value : 30.621 clock cycles Test-Case 1.3 - Overhead of Assignment of a Value from RDTSC instruction Min Overhead of Assignment: 0.000 clock cycles Final RDTSC Overhead Value: 24.000 clock cycles

SergeyKostrov · ‎05-17-2013

[ Tests on Ivy Bridge system ] [ Ivy Bridge 2.8GHz CPU with Microsoft C++ compiler 32-bit ] Test-Case 1.1 - Overhead of RDTSC instruction Min RDTSC Overhead Value : 24.000 clock cycles Test-Case 1.2 - Overhead of RDTSC instruction Total Delta Value : 642290240 clock cycles Avg RDTSC Overhead Value : 64.229 clock cycles Test-Case 1.3 - Overhead of Assignment of a Value from RDTSC instruction Min Overhead of Assignment: 0.000 clock cycles Final RDTSC Overhead Value: 24.000 clock cycles [ Ivy Bridge 2.8GHz CPU with Intel C++ compiler 32-bit ] Test-Case 1.1 - Overhead of RDTSC instruction Min RDTSC Overhead Value : 24.000 clock cycles Test-Case 1.2 - Overhead of RDTSC instruction Total Delta Value : 463084704 clock cycles Avg RDTSC Overhead Value : 46.308 clock cycles Test-Case 1.3 - Overhead of Assignment of a Value from RDTSC instruction Min Overhead of Assignment: 0.000 clock cycles Final RDTSC Overhead Value: 24.000 clock cycles

SergeyKostrov · ‎05-17-2013

Regarding Overhead of Assignment. >>... >>Min Overhead of Assignment: 4.000 clock cycles >>... 4 clock cycles means that I managed to measure the time of some processing and it was done in 2.5 nano-seconds (!) on the Pentium 4 1.6GHz CPU.

John_Campbell · ‎05-17-2013

Sergey,

If you are asking a question on observed variability of:
- call duration, or
- variability in the time parameter increment between successive changes,
I have noted significant variability in my measures of these. For example, in the following code example, modified from my earlier attachment, the variation in the change in NUM, can be calculated for mean and standard deviation, and I have reported when the d_num value deviates significantly from the average. These instances occur with all the timing routines, and some of these deviations from the average can be many standard deviations.
I think there are multiple reasons for this, related to the system running many processes.

[fortran]      do n = 1,n_lim
         call use_system_clock (time, num)
         if (num == last_num) cycle
         last_num = num         ! last tick
         i        = i+1
         list(i) = num
         sec(i)   = time
         if (i == size(list)) exit
      end do [/fortran]

What I do not know, is if these variations effect the long term accuracy of the time measure. In the case of elapsed time, it would depend on how the timer is calibrated and what is the ultimate reference clock. RDTSC could be a reference clock, but there might be others. Any attempt to identify better accuracy is probably not justified, as elapsed time performance measure of a process is probably more affected by other background processes, so it is all part of the noise.
In the case of CPU time, this is an accumulation of the estimate of when the process is running. This has been more relevant when trying to identify lost time, such as I/O interrupt delays when I have tried to better manage buffered I/O performance. It is becoming less significant than elapsed time.

The optimisation problems I am attempting to solve are not at the few processor cycles level, but they are certainly shorter than 42 million processor cycles, which is precision you can get from the Fortran intrinsics.

The short answer is: I have seen the variability of performance in a number of areas, and I think most is due to shared processes. I am not convinced this variability is a significant problem.

John

Ps : I hope this post gets delivered soon, as my past post was held over for review, making it less relevant to the other later posts, when finally released. Not sure what made it qualify for being stopped ?

Also, why can't this windows forum accept windows cut and paste ?

SergeyKostrov · ‎05-17-2013

>>...What I do not know, is if these variations effect the long term accuracy of the time measure... All observed variations happen because out test applications are executed in a multi-threaded environment and you properly assumed that measurements are ''...affected by other background processes, so it is all part of the noise...'. >>... I have seen the variability of performance in a number of areas, and I think most is due to shared processes... Could you provide a couple of examples? My point of view is as follows: I wouldn't be worried if a 15-min processing executed for 15,000 nano-seconds longer. But it is a real problem if a really small computation, for example usually executed in 0.005 seconds, in some cases is executed 2.5 times slower, or so.

Bernard · ‎05-17-2013

>>>I think there are multiple reasons for this, related to the system running many processes.>>>

I think that such a behaviour can be related to thread context switching.This maybe can explain variation of your measurements.Try to run your thread at higher priority level to diminish the frequency of thread context switches and observe how the variation changes.

Bernard · ‎05-17-2013

>>>Also, why can't this windows forum accept windows cut and paste ?>>>

I have never experienced such a issue on my browser.Btw I am using Firefox.

Bernard · ‎05-17-2013

>>>I think this is expected because Windows interrupts my processing to do something else.>>>

I think that this is related to servicing interrupts and to running scheduler code itself.

John_Campbell · ‎05-17-2013

Sergey,

Attached is an example of testing variability of System_Clock. (The version I had was from another compiler, so hopefully this version works with ifort. I don't have ifort at home. If not I will resubmit on Monday.)
This can be adapted for testing CPU_Time also.

John

SergeyKostrov · ‎05-19-2013

Thanks John for the updated sources. >>Regarding Overhead of Assignment. >> >>>>... >>>>Min Overhead of Assignment: 4.000 clock cycles >>>>... That doesn't look right and it has to be around 1 clock cycle ( for a regular MOV instruction ). Am I wrong?

precision of CPU_Time and System_Clock