Software Tuning, Performance Optimization & Platform Monitoring
Discussion regarding monitoring and software tuning methodologies, Performance Monitoring Unit (PMU) of Intel microprocessors, and platform updating.

What causes the retired instructions to increase?

Shiv_Inside
Beginner
812 Views

This is a re-post from stack-overflow: http://stackoverflow.com/questions/23020148/what-causes-the-retired-instructions-to-increase

I have a 496*O(N^3) loop. I am performing a blocking optimization technique where I'm operating 2 images at a time instead of 1. In raw terms, I am unrolling the outer loop. (The non-unrolled version of the code is as shown below: ) b.t.w I'm using Intel Xeon X5365 machine that has 8 cores and it has 3GHz clock, 1333MHz bus frequency, Shared 8MB L2( 4 MB shared between every 2 core), L1-I 32KB,L1-D 32KB .

for(imageNo =0; imageNo<496;imageNo++){
for (unsigned int k=0; k<256; k++)
{
    double z = O_L + (double)k * R_L;
    for (unsigned int j=0; j<256; j++)
    {
        double y = O_L + (double)j * R_L;

        for (unsigned int i=0; i<256; i++)
        {
            double x[1] = {O_L + (double)i * R_L} ;             
            double w_n =  (A_n[2] * x[0] + A_n[5] * y + A_n[8] * z + A_n[11])  ;
            double u_n =  ((A_n[0] * x[0] + A_n[3] * y + A_n[6] * z + A_n[9] ) / w_n);
            double v_n =  ((A_n[1] * x[0] + A_n[4] * y + A_n[7] * z + A_n[10]) / w_n);                      

            for(int loop=0; loop<1;loop++)
            {
                px_x[loop] = (int) floor(u_n);
                px_y[loop] = (int) floor(v_n);
                alpha[loop] = u_n - px_x[loop] ;
                beta[loop]  = v_n - px_y[loop] ;
            }
///////////////////(i,j) pixels ///////////////////////////////
                if (px_x[0]>=0 && px_x[0]<(int)threadCopy[0].S_x && px_y[0]>=0 && px_y[0]<(int)threadCopy[0].S_y)                   

                    pixel_1[0] = threadCopy[0].I_n[px_y[0] * threadCopy[0].S_x + px_x[0]];
                else
                    pixel_1[0] = 0.0;               

                if (px_x[0]+1>=0 && px_x[0]+1<(int)threadCopy[0].S_x && px_y[0]>=0 && px_y[0]<(int)threadCopy[0].S_y)                   

                    pixel_1[2] = threadCopy[0].I_n[px_y[0] * threadCopy[0].S_x + (px_x[0]+1)];
                else
                    pixel_1[2] = 0.0;                   

/////////////////// (i+1, j) pixels/////////////////////////    

                if (px_x[0]>=0 && px_x[0]<(int)threadCopy[0].S_x && px_y[0]+1>=0 && px_y[0]+1<(int)threadCopy[0].S_y)
                    pixel_1[1] = threadCopy[0].I_n[(px_y[0]+1) * threadCopy[0].S_x + px_x[0]];
                else
                    pixel_1[1] = 0.0;                   

                if (px_x[0]+1>=0 && px_x[0]+1<(int)threadCopy[0].S_x && px_y[0]+1>=0 && px_y[0]+1<(int)threadCopy[0].S_y)

                    pixel_1[3] = threadCopy[0].I_n[(px_y[0]+1) * threadCopy[0].S_x + (px_x[0]+1)];

                else 

                    pixel_1[3] = 0.0;

                pix_1 = (1.0 - alpha[0]) * (1.0 - beta[0]) * pixel_1[0] + (1.0 - alpha[0]) * beta[0]  * pixel_1[1]

                +  alpha[0]  * (1.0 - beta[0]) * pixel_1[2]   +  alpha[0]  *  beta[0]  * pixel_1[3];                                

            f_L[k * L * L + j * L + i] += (float)(1.0 / (w_n * w_n) * pix_1);
                        }
    }

}
   }

I profiled the results using Intel Vtune-2013 (Using binary created from gcc-4.1) and I can see that there is 40% reduction in memory bandwidth usage which was expected because 2 images are being processed for every iteration.(f_L store operation causes 8 bytes of traffic for every voxel). This accounts to 11.7% reduction in bus cycles! Also, since the block size is increased in the inner loop, the resource stalls decrease by 25.5%. These 2 accounts for 18% reduction in response time. The mystery question is, why are instruction retired increased by 7.9%? (Which accounts for increase in response time by 6.51%) - Possible reason I could this of is: 1. Since the number of branch instructions increase inside the block (and core architecture has 8 bit global history) retired branch instruction increased by 2.5%( Although, mis-prediction remained the same! I know, smells fishy right?!!). But I am still missing answer for the rest 5.4%! Could anyone please shed me light in any direction? I'm completely out of options and No way to think. Thanks a lot!!

0 Kudos
1 Reply
Shiv_Inside
Beginner
812 Views

Should I re-post this question in a different Intel forum or this is the right place ?

Thanks.

0 Kudos
Reply