more cycles per micro-instruction than cycles per instruction

lkleen · ‎06-03-2009

I'm currently profiling the behavior of a core i7 cpu with some benchmarks taken from the technology journal article 'Hyper-Threading Technology: Impact on Compute-Intensive Workloads'. One of the Benchmarks generates a result wich is hard to explain for me.

[cpp]	virtual void process ()
	{
	
		int32 inner = 0;
		int32 result = 0;

		srand ( time ( NULL ) );

		for ( int32 i = 0; i< npoints; i++ )
		{
			double x = (((float) rand()) / RAND_MAX * 2) - 1;
			double y = (((float) rand()) / RAND_MAX * 2) - 1;
			
			if ( sqrt (x*x + y*y) <= 1 )
			{
				inner++;
			}
		}

		(result) += inner;	
	
	}[/cpp]

This snippet runs simultaniously on 4 cores with disabled hyper-threading. When profiling with VTune I'm measuring a CPI-value of 0.76 with a CPupos value of 0.80. The CPI is measured with the 'build-in' ratio, the CPuops-value is measured with a self-defined ratio ([pmn:CPU_CLK_UNHALTED.THREAD]/[pmn:UOPS_RETIRED.ANY] ). The ratios for the other benchmarks are measured as expected so I think there is no misconfiguration but since the cpu decodes an instruction to at least 1 micro-operation this result should be impossible. Any ideas?

thanks in advance,
Lars

Thomas_W_Intel · ‎06-06-2009

Quoting - lkleen

since the cpu decodes an instruction to at least 1 micro-operation this result should be impossible. Any ideas?

Lars,

There are some cases, where two instructions are translated to only 1 op, for example several combinations of test or compare together with a conditional jump. This feature is called "macro-fusion" and could explain your observation.

Kind regards

Thomas