Analyzers
Talk to fellow users of Intel Analyzer tools (Intel VTune™ Profiler, Intel Advisor)
5015 Discussions

A newbi question about where the cache miss could end up

Jan_Hardenbergh
Beginner
356 Views
I'm teachable and will to do my homework, but, it has been a while since I looked closely at CPU perfromance.

I'm looking at t little bit of rendering code that is stepping down a scanline and
seeing if it is done with a pixel,
then getting the next texture value - dq

.
After the texture fetch it looks up the color in a 4096 RGBA lut and
tests if the color is zero. - test fzsl - this test takes 28% of the total time

I'm sure there are many ways this could be improved, but, my task now is just to understand it.
I've done several runs and the numbers at each statement are pretty stable.
So, can the test of fzsl be where the cache misses catch up? That is 28% of the total time is spent there.
Only 6% wher the fetch is initiated and another 8% when the result is used to look up the LUT.
Or, are these number really just more of a statistical neighborhood heisenberg type number and not specifically about the statements.

for ( int x=s_x; x!=e_x; x+=n_x )      1.52946  15%
{
    p += dx;                   0.64849       7%
    int ss = sr[ x ];                0.0199
if ( ss==-1 ) continue; 0.18914
short sz = (ss>> 0) & 0xffff; 0.11956
short ez = (ss>>16) & 0xffff; 0.00992
if ( (sz>z) || (ez int dat = (dq[ p ]) & 0xffff; 0.585804 6% <<<<<<< texture fetch
int msk_val = dat & 0x0000f000; 0.00993
__m128 m_lut = _mm_load_ps( &h_lut_buf[ dat<<2 ] ); 0.75983988 8% <<<< LUT lookup
int fzsl = _mm_ucomilt_ss( m_lut, m_zsl );
if ( fzsl ) 2.36337947 28% <<<<<<<<<<< test fxsl
continue;

Thank you for any breadcrumbs!
YON
0 Kudos
1 Reply
Peter_W_Intel
Employee
356 Views
Here are my observations after reading your code with performance data.
Statement of test of fzsl spent 28% of total time, it doesn't make sense. I guess there were many ICache misses, the reason could be:
1. Big loop-body is not good programming style - it will cause ICache misses.
2. Ensure if _mm_load_ps() & _mm_ucomilt_ss() are inlined. If yes, it will increase loop_body size.
You may do:
1. Usually Compiler "O2" option will do "inline" enabling, you may disable it and retry VTune
2. Change algorithm and reduce loop size.
Regards, Peter
0 Kudos
Reply