How to optimize the following hotspot in the code reported by Vtune Amplifier?

Feroz_A_ · ‎01-28-2017

The following code snippet is reported as the top hotspot :

i = some positive int
j = some positive int 
L = some positive int

//j < i
x = some positive int
p = some prime number ( int)

while (i < L)
{
	arr = arr + x;      //this line is reported to be the hotspot
	arr = arr + x;
	i = i + p;
	j = j + p;
}

McCalpinJohn · ‎01-30-2017

Understanding what is happening here depends on the type of the "arr" variable, the difference between the value of "L" and the initial value of "i", and the difference between the initial values of "i" and "j".

If "L" is enough larger than the initial value of "i" that the accessed portion of the "arr" array will not will not fit in the L1 data cache, then the reported "hotspot" will correspond to the loads that miss in the L1 data cache. If "i-j" is small enough that arr..arr will fit in the L1 data cache, then the second load statement should hit in the L1 data cache, so it should be inexpensive.

The first thing to look at is the compiler optimization reports for the loop starting at line 09. For large values of "L" (relative to the initial value of "i"), the loop should be vectorizable. In this case the hardware prefetchers should work well and no further optimization is usually needed.

The compiler may or may not choose to build special case code to handle various relative sizes of "i" and "L", so it is possible that the processor is executing scalar code here. Scalar code is not always a performance problem -- it depends on the processor model and the size of the data range being accessed.

It is typically valuable to try to map out where the data is coming from and estimate how many cycles should be required to move the data through the memory hierarchy -- then to compare the measured execution time with the estimate. This requires seeing the rest of the code (to know when the "arr" data was last accessed, and how much data has been loaded between that point and this point) as well as knowing which processor generation and model is being used.

TimP · ‎01-30-2017

In order to vectorize, compilers are likely to prefer an equivalent for() with both arrays using the same index with offset (or simd linear and safelen clauses) as well as __restrict definition (if not asserting simd with a compiler supporting it).

In case the compiler default assumptions about loop count aren't appropriate, intel compiler loop count pragma or profiler feedback may help.

Value of p being unknown at compile time may prevent simd optimization or recognition of efforts to assert i>j