Software Tuning, Performance Optimization & Platform Monitoring
Discussion regarding monitoring and software tuning methodologies, Performance Monitoring Unit (PMU) of Intel microprocessors, and platform updating.
Announcements
FPGA community forums and blogs on community.intel.com are migrating to the new Altera Community and are read-only. For urgent support needs during this transition, please visit the FPGA Design Resources page or contact an Altera Authorized Distributor.

What causes the retired instructions to increase?

Shiv_Inside
Beginner
1,085 Views

This is a re-post from stack-overflow: http://stackoverflow.com/questions/23020148/what-causes-the-retired-instructions-to-increase

I have a 496*O(N^3) loop. I am performing a blocking optimization technique where I'm operating 2 images at a time instead of 1. In raw terms, I am unrolling the outer loop. (The non-unrolled version of the code is as shown below: ) b.t.w I'm using Intel Xeon X5365 machine that has 8 cores and it has 3GHz clock, 1333MHz bus frequency, Shared 8MB L2( 4 MB shared between every 2 core), L1-I 32KB,L1-D 32KB .

for(imageNo =0; imageNo<496;imageNo++){
for (unsigned int k=0; k<256; k++)
{
    double z = O_L + (double)k * R_L;
    for (unsigned int j=0; j<256; j++)
    {
        double y = O_L + (double)j * R_L;

        for (unsigned int i=0; i<256; i++)
        {
            double x[1] = {O_L + (double)i * R_L} ;             
            double w_n =  (A_n[2] * x[0] + A_n[5] * y + A_n[8] * z + A_n[11])  ;
            double u_n =  ((A_n[0] * x[0] + A_n[3] * y + A_n[6] * z + A_n[9] ) / w_n);
            double v_n =  ((A_n[1] * x[0] + A_n[4] * y + A_n[7] * z + A_n[10]) / w_n);                      

            for(int loop=0; loop<1;loop++)
            {
                px_x[loop] = (int) floor(u_n);
                px_y[loop] = (int) floor(v_n);
                alpha[loop] = u_n - px_x[loop] ;
                beta[loop]  = v_n - px_y[loop] ;
            }
///////////////////(i,j) pixels ///////////////////////////////
                if (px_x[0]>=0 && px_x[0]<(int)threadCopy[0].S_x && px_y[0]>=0 && px_y[0]<(int)threadCopy[0].S_y)                   

                    pixel_1[0] = threadCopy[0].I_n[px_y[0] * threadCopy[0].S_x + px_x[0]];
                else
                    pixel_1[0] = 0.0;               

                if (px_x[0]+1>=0 && px_x[0]+1<(int)threadCopy[0].S_x && px_y[0]>=0 && px_y[0]<(int)threadCopy[0].S_y)                   

                    pixel_1[2] = threadCopy[0].I_n[px_y[0] * threadCopy[0].S_x + (px_x[0]+1)];
                else
                    pixel_1[2] = 0.0;                   

/////////////////// (i+1, j) pixels/////////////////////////    

                if (px_x[0]>=0 && px_x[0]<(int)threadCopy[0].S_x && px_y[0]+1>=0 && px_y[0]+1<(int)threadCopy[0].S_y)
                    pixel_1[1] = threadCopy[0].I_n[(px_y[0]+1) * threadCopy[0].S_x + px_x[0]];
                else
                    pixel_1[1] = 0.0;                   

                if (px_x[0]+1>=0 && px_x[0]+1<(int)threadCopy[0].S_x && px_y[0]+1>=0 && px_y[0]+1<(int)threadCopy[0].S_y)

                    pixel_1[3] = threadCopy[0].I_n[(px_y[0]+1) * threadCopy[0].S_x + (px_x[0]+1)];

                else 

                    pixel_1[3] = 0.0;

                pix_1 = (1.0 - alpha[0]) * (1.0 - beta[0]) * pixel_1[0] + (1.0 - alpha[0]) * beta[0]  * pixel_1[1]

                +  alpha[0]  * (1.0 - beta[0]) * pixel_1[2]   +  alpha[0]  *  beta[0]  * pixel_1[3];                                

            f_L[k * L * L + j * L + i] += (float)(1.0 / (w_n * w_n) * pix_1);
                        }
    }

}
   }

I profiled the results using Intel Vtune-2013 (Using binary created from gcc-4.1) and I can see that there is 40% reduction in memory bandwidth usage which was expected because 2 images are being processed for every iteration.(f_L store operation causes 8 bytes of traffic for every voxel). This accounts to 11.7% reduction in bus cycles! Also, since the block size is increased in the inner loop, the resource stalls decrease by 25.5%. These 2 accounts for 18% reduction in response time. The mystery question is, why are instruction retired increased by 7.9%? (Which accounts for increase in response time by 6.51%) - Possible reason I could this of is: 1. Since the number of branch instructions increase inside the block (and core architecture has 8 bit global history) retired branch instruction increased by 2.5%( Although, mis-prediction remained the same! I know, smells fishy right?!!). But I am still missing answer for the rest 5.4%! Could anyone please shed me light in any direction? I'm completely out of options and No way to think. Thanks a lot!!

0 Kudos
1 Reply
Shiv_Inside
Beginner
1,085 Views

Should I re-post this question in a different Intel forum or this is the right place ?

Thanks.

0 Kudos
Reply