- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I experience an strange situation when I am optimizing some mic code.
The new optimized code runs faster, measured using __rdtsc(). But the new run time is actually slower than the old code! The code, by the way, is not a loop, and my co-worker found sometimes loop runs faster then unrolled loop. This lead me to speculate that icache may be starved due to too may vector operation, so I added _mm_delay_32(n) to let it recover.
This is the result I got
no delay added -- run time 6.65
delay(4) -- run time 6.63
delay(8) -- run time 6.60
delay(12) -- run time 6.57
So can someone verify where my speculation has any basis in fact?
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Someone else may know more about what is going on and step in with the answer but my suggestion would be to try Intel(r) VTune(tm) Amplifier, if you have it. That should be able to give you some idea of where the time is going and whether you have cache issues.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
The delay issued by one thread within a core permits a different thread within the same core to take the delaying thread's time slice of the core.
This is different that what you experience with Intel64 or IA32.
RE: VTune and this issue
This would depend on if VTune included or excluded the delay ticks from the instruction cycle counts.
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
The problem seems to be with c switch statement. When I recoded without switch, the problem seems to go away. I guess the instruction branch prefetch just doesn't like the jump table.

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page