Intel® Moderncode for Parallel Architectures
Support for developing parallel programming applications on Intel® Architecture.

counting core cycle

karimfath
Beginner
2,332 Views

hello

please i want calculate how many cycle do the core during the execution of the program (c+openmp)

thanks

0 Kudos
7 Replies
fb251
Beginner
2,332 Views
Quoting - karimfath

hello

please i want calculate how many cycle do the core during the execution of the program (c+openmp)

thanks

There's a specific instruction to read the internal counter:

rdtsc

Counter is returned into edx:eax or rdx:rax according to mode (32 or 64 bits).

I've done some search and found this example.

Best regards.

0 Kudos
TimP
Honored Contributor III
2,332 Views
Quoting - karimfath

hello

please i want calculate how many cycle do the core during the execution of the program (c+openmp)

thanks

The rdtsc instruction, for recent Intel CPUs, actually counts front side buss ticks, multiplied by the clock multiplier. Otherwise, the count would break during CPU power management transitions. On my Core 2 Duo laptop, it appears to adjust appropriately so that it does actually report CPU cycles, regardless of whether it is running full speed on AC power or 40% speed on battery, but this is not true of all models.

On my Core 2 duo desktop, it is necessary to disable EIST power management, in order for the speed multiplier used by rdtsc to match the actual speed multiplier of the CPU. I don't observe the same problem on the Core 2 duo laptop, nor on i7 desktop.

On CPUs with turbo mode, it would be necessary to disable that mode, as the rdtsc counter would not speed up when the CPU does.

Only on Penryn and i7 (and AMD64 CPU) models does the rdtsc give results accurate to within a very few baseline cycles, which translates to less than 50 CPU clock cycles. The inaccuracy was well over 100 cycles on early P4, where it counted CPU cycles directly, and it was worse on early Athlon-32.

The __rdtsc() "intrinsic" is built in to Microsoft and Intel compilers for Xeon family CPUs. If you use the rdtsc instruction in gcc, in-line asm is required, and it differs between 32- and 64-bit mode.

I have not encountered any difficulties in comparison of rdtsc counts among various cores on a single motherboard. On current CPUs, all cores in one socket share a common time base, and dual CPUs have their counters synchronized within a few hundred ticks at boot time. In principle, on i7 NUMA platforms, the power management states could vary among the sockets as well as between the core and uncore, with the latter presumably controlling rdtsc. So, it might be necessary to set affinity when comparing rdtsc counts, but I haven't run into that in practice. If that problem did arise, it might be preferable to consider HPET or the SSE4.2 version of rdtsc.

The portable way to measure performance in OpenMP is with omp_get_wtime(), which usually is based on an OS function such as gettimeofday((). When I have tried this, it gives times accurate well within 1 millisecond, which doesn't match the ability of rdtsc to give accuracy within 50 to 150 clock ticks (several hundred if comparing multiple sockets).

0 Kudos
karimfath
Beginner
2,332 Views

thanks for your help tim18

if we inderstand there is no difference between core time in multicore processor because rdtsc() use the front side bus tick

is it true

thanks

0 Kudos
TimP
Honored Contributor III
2,332 Views
Quoting - karimfath

I discussed the consistency of rdtsc among cores above, possibly at too much length. The cores on a single socket share the rdtsc generator. For multiple sockets, there is no guarantee for the future, but all current dual socket machines have a BIOS implementation which synchronizes rdtsc between sockets, within a few hundred reported cycles. As you suggest, as they share the FSB time base, so they remain synchronized.

0 Kudos
Dmitry_Vyukov
Valued Contributor I
2,332 Views
Quoting - tim18

I discussed the consistency of rdtsc among cores above, possibly at too much length.

Too much length is not the problem... at least on technical forum. Not much length is usually the problem ;)

0 Kudos
jamiecook
Beginner
2,332 Views

I am using the RDTSC to do some timing of my code base and have run into the following scenario when using a core i7

In order to convert the number of ticks into a wall time you need to know the CPU speed

[cpp]unsigned __int64 VlcTimer::calculateCpuSpeed(float sleepPeriodInSeconds /* = 1 */)
{
    unsigned __int64 t1 = RDTSC(); 
    Sleep(1000 * sleepPeriodInSeconds); // Sleep for 1 second
    return (RDTSC()-t1) / sleepPeriodInSeconds; // in Hz
}

unsigned __int64 VlcTimer::getAccumulatedTimeInMilliSec()
{
return 1000 * ((double) m_accumulator / (double) m_cpuSpeed);
} [/cpp]
Which generally works pretty well for determining how long something takes to execute, however the test for this functionality looks as follows:

[cpp]BOOST_AUTO_TEST_CASE_OFF(TimerClockCycles)
{
	VlcTimer vlcTimer;
	vlcTimer.start(); 
	Sleep(100);
	vlcTimer.stop();
	BOOST_CHECK_CLOSE((float) vlcTimer.getAccumulatedTimeInMilliSec(), 100.0f, 0.01);  
}[/cpp]

Which doesn't work so well, the problem being that the act of running the test ramps up the clock speed meaning that there are more ticks in a given 100ms timeframe and the measurement consistently comes up slightly high.

I'm wondering if someone here has any suggestions for how to convert back to wall time when CPU frequency scaling is in effect?

0 Kudos
TimP
Honored Contributor III
2,332 Views

On Core I7, rdtsc uses a fixed multiplier on uncore ticks. In all tests I've been able to perform, it proceeds at a constant rate (3.192e9/sec) regardless of Turbo mode and the like.

A few years ago, it was common to check calibration of rdtsc by running a tight loop and comparing with gettimeofday() or the like.

0 Kudos
Reply