CPI, CPu, Hyperthreading ( core i7 )

lkleen · ‎12-18-2008

I'm currently examining the impact of hyperthreading to different code snippets. While measuring the expected values with the most tests there is one test wich produces a confusing CPI.

[cpp]for ( int i = 0; i< npoints; i++ )
{
	double x, y;
		
	x = guess ();
	y = guess ();

	if ( sqrt (x*x + y*y) <= 1 )
	{
		inner++;
	}
}[/cpp]

This loop runs for 134217728 times distributed to 8 or 4 threads depending on the number of logical cores. With activated hyperthreading I'm gaining a speed-up of 45% with a CPu of 0,60 ( 8 cores ) and 0,81 ( 4 cores ). For the CPI value I'm measuring 1,54 ( 8 cores ) and 11,81 ( 4 cores ).

I supposed the high value for the test with 4 cores could be caused by the loop stream detection so I also examined the LSD activity. But I'm not measuring any LSD.ACTIVE event when running the test so I really don't know how to explain this high CPI value. How could this be caused?

Thanks in advance.

Lars Kleen

TimP · ‎12-18-2008

As sqrt() (non-vectorized) could be a big bottleneck on current core i7 CPUs (slower than Penryn), the useful speedup by HT would be based on the possibility of other functional units being able to accomplish more work. Not that I can see how this snippet could represent a useful scenario.

The HT value of CPI could be low on account of spin wait instructions being executed, or some such issue. If that's your goal, maybe you have achieved it. Such instructions ought to show up in VTune, but probably should not be associated with the module you are profiling.

As HT would not increase the rate of execution of sqrt instructions, one would think any reduction in CPI would be due to execution of more instructions by other units.

I'm not a fan of comparisons based on CPI. For one thing, compilers which attempt to optimize for HT increase the number of "overhead" instructions executed by a far greater degree than they increase the rate of useful instructions, when comparing with a compiler not designed for HT. This is true even within the thread associated with the module of interest.

lkleen · ‎12-19-2008

The code is not utilized to do anything usefull, it could be used to calculata an approximation to Pi but thats not what I'm going to do. I just want to examine the effect of HT to an algorithm where different functional units (floating point and integer) can be used simultaniously. As expected it achieves a good speed-up of 45% when measuring the execution time, but the CPI value doesn't fit to this speed-up.

I've used the same environment for different benchmarks where the ratio of micro-ops retired to instructions retiered doesn't vary in this manner. I will run the same benchmark on another machine without loop stream detection to post the results later.

lkleen · ‎01-07-2009

running the same benchmark on a core 2 quad processor resulted in a CPI value of 1.28 and a CPu value of 0.64.