Software Tuning, Performance Optimization & Platform Monitoring
Discussion regarding monitoring and software tuning methodologies, Performance Monitoring Unit (PMU) of Intel microprocessors, and platform updating.

What causes the retired instructions to increase?

Beginner
275 Views

This is a re-post from stack-overflow: http://stackoverflow.com/questions/23020148/what-causes-the-retired-instructions-to-increase

I have a 496*O(N^3) loop. I am performing a blocking optimization technique where I'm operating 2 images at a time instead of 1. In raw terms, I am unrolling the outer loop. (The non-unrolled version of the code is as shown below: ) b.t.w I'm using Intel Xeon X5365 machine that has 8 cores and it has 3GHz clock, 1333MHz bus frequency, Shared 8MB L2( 4 MB shared between every 2 core), L1-I 32KB,L1-D 32KB .

```for(imageNo =0; imageNo<496;imageNo++){
for (unsigned int k=0; k<256; k++)
{
double z = O_L + (double)k * R_L;
for (unsigned int j=0; j<256; j++)
{
double y = O_L + (double)j * R_L;

for (unsigned int i=0; i<256; i++)
{
double x[1] = {O_L + (double)i * R_L} ;
double w_n =  (A_n[2] * x[0] + A_n[5] * y + A_n[8] * z + A_n[11])  ;
double u_n =  ((A_n[0] * x[0] + A_n[3] * y + A_n[6] * z + A_n[9] ) / w_n);
double v_n =  ((A_n[1] * x[0] + A_n[4] * y + A_n[7] * z + A_n[10]) / w_n);

for(int loop=0; loop<1;loop++)
{
px_x[loop] = (int) floor(u_n);
px_y[loop] = (int) floor(v_n);
alpha[loop] = u_n - px_x[loop] ;
beta[loop]  = v_n - px_y[loop] ;
}
///////////////////(i,j) pixels ///////////////////////////////

else
pixel_1[0] = 0.0;

else
pixel_1[2] = 0.0;

/////////////////// (i+1, j) pixels/////////////////////////

else
pixel_1[1] = 0.0;

else

pixel_1[3] = 0.0;

pix_1 = (1.0 - alpha[0]) * (1.0 - beta[0]) * pixel_1[0] + (1.0 - alpha[0]) * beta[0]  * pixel_1[1]

+  alpha[0]  * (1.0 - beta[0]) * pixel_1[2]   +  alpha[0]  *  beta[0]  * pixel_1[3];

f_L[k * L * L + j * L + i] += (float)(1.0 / (w_n * w_n) * pix_1);
}
}

}
}```

I profiled the results using Intel Vtune-2013 (Using binary created from gcc-4.1) and I can see that there is 40% reduction in memory bandwidth usage which was expected because 2 images are being processed for every iteration.(f_L store operation causes 8 bytes of traffic for every voxel). This accounts to 11.7% reduction in bus cycles! Also, since the block size is increased in the inner loop, the resource stalls decrease by 25.5%. These 2 accounts for 18% reduction in response time. The mystery question is, why are instruction retired increased by 7.9%? (Which accounts for increase in response time by 6.51%) - Possible reason I could this of is: 1. Since the number of branch instructions increase inside the block (and core architecture has 8 bit global history) retired branch instruction increased by 2.5%( Although, mis-prediction remained the same! I know, smells fishy right?!!). But I am still missing answer for the rest 5.4%! Could anyone please shed me light in any direction? I'm completely out of options and No way to think. Thanks a lot!!