A newbi question about where the cache miss could end up

Jan_Hardenbergh · ‎04-24-2012

I'm teachable and will to do my homework, but, it has been a while since I looked closely at CPU perfromance.

I'm looking at t little bit of rendering code that is stepping down a scanline and
seeing if it is done with a pixel,
then getting the next texture value - dq

.
After the texture fetch it looks up the color in a 4096 RGBA lut and
tests if the color is zero. - test fzsl - this test takes 28% of the total time

I'm sure there are many ways this could be improved, but, my task now is just to understand it.
I've done several runs and the numbers at each statement are pretty stable.
So, can the test of fzsl be where the cache misses catch up? That is 28% of the total time is spent there.
Only 6% wher the fetch is initiated and another 8% when the result is used to look up the LUT.
Or, are these number really just more of a statistical neighborhood heisenberg type number and not specifically about the statements.

for ( int x=s_x; x!=e_x; x+=n_x )      1.52946  15%
{
    p += dx;                   0.64849       7%
    int ss = sr[ x ];                0.0199
 if ( ss==-1 ) continue;         0.18914
 short sz = (ss>> 0) & 0xffff;       0.11956
 short ez = (ss>>16) & 0xffff;       0.00992
 if ( (sz>z) || (ez int dat = (dq[ p ]) & 0xffff;     0.585804            6%          <<<<<<< texture fetch            
 int msk_val = dat & 0x0000f000;      0.00993
 __m128 m_lut = _mm_load_ps( &h_lut_buf[ dat<<2 ] );  0.75983988   8% <<<< LUT lookup
 int fzsl = _mm_ucomilt_ss( m_lut, m_zsl ); 
 if ( fzsl )                     2.36337947  28%          <<<<<<<<<<< test fxsl
  continue;

Thank you for any breadcrumbs!
YON

Peter_W_Intel · ‎04-25-2012

Here are my observations after reading your code with performance data.

Statement of test of fzsl spent 28% of total time, it doesn't make sense. I guess there were many ICache misses, the reason could be:

1. Big loop-body is not good programming style - it will cause ICache misses.

2. Ensure if _mm_load_ps() & _mm_ucomilt_ss() are inlined. If yes, it will increase loop_body size.

You may do:

1. Usually Compiler "O2" option will do "inline" enabling, you may disable it and retry VTune

2. Change algorithm and reduce loop size.

Regards, Peter