I am trying to optimize my code. I use memory access analysis and pick up the top line function in the CPU time order.
i open the function in source code and assembly code. At every line analysis, i found something strange.
The biggest time consumer line is a sample punpcklbw assembly code and the code have large loads/stores. I think it is impossible, the compute just access a xmm register. The code and analysis has upload as a image.
The block 70 assembly code is "if (left)" c code branch.
The "punpcklbw xmm2, xmm2" code do not access memory. So why this line has large loads/stores ?
so who can help me for the analysis result? and how can i decrease the time consume for this code block?
This block is biggest time consumer in the biggest time consumer function.
You should take into account event skid:
For clockticks event which is used for CPU Time metric the skid is usually one instruction (but could be more in some cases). So most likely the most time consuming instruction is the one before the "punpcklbw xmm2, xmm2".
Also processor has a limited set of 'precise' events which do not suffer from skid problem: https://software.intel.com/en-us/vtune-amplifier-help-precise-events
Memory access analysis contains several metrics like 'Loads', 'Stores', 'LLC Cache Misses' which are built on precise events and therefore should point exactly to instruction that caused them.