adding _mm_delay actually make the code run faster?

James_C_9 · ‎01-15-2014

I experience an strange situation when I am optimizing some mic code.

The new optimized code runs faster, measured using __rdtsc(). But the new run time is actually slower than the old code! The code, by the way, is not a loop, and my co-worker found sometimes loop runs faster then unrolled loop. This lead me to speculate that icache may be starved due to too may vector operation, so I added _mm_delay_32(n) to let it recover.

This is the result I got

no delay added -- run time 6.65

delay(4) -- run time 6.63

delay(8) -- run time 6.60

delay(12) -- run time 6.57

So can someone verify where my speculation has any basis in fact?

Frances_R_Intel · ‎01-15-2014

Someone else may know more about what is going on and step in with the answer but my suggestion would be to try Intel(r) VTune(tm) Amplifier, if you have it. That should be able to give you some idea of where the time is going and whether you have cache issues.

jimdempseyatthecove · ‎01-16-2014

The delay issued by one thread within a core permits a different thread within the same core to take the delaying thread's time slice of the core.

This is different that what you experience with Intel64 or IA32.

RE: VTune and this issue

This would depend on if VTune included or excluded the delay ticks from the instruction cycle counts.

Jim Dempsey

James_C_9 · ‎01-16-2014

The problem seems to be with c switch statement. When I recoded without switch, the problem seems to go away. I guess the instruction branch prefetch just doesn't like the jump table.