- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
The following code snippet is reported as the top hotspot :
i = some positive int j = some positive int L = some positive int //j < i x = some positive int p = some prime number ( int) while (i < L) { arr = arr + x; //this line is reported to be the hotspot arr= arr + x; i = i + p; j = j + p; }
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Understanding what is happening here depends on the type of the "arr" variable, the difference between the value of "L" and the initial value of "i", and the difference between the initial values of "i" and "j".
If "L" is enough larger than the initial value of "i" that the accessed portion of the "arr" array will not will not fit in the L1 data cache, then the reported "hotspot" will correspond to the loads that miss in the L1 data cache. If "i-j" is small enough that arr
The first thing to look at is the compiler optimization reports for the loop starting at line 09. For large values of "L" (relative to the initial value of "i"), the loop should be vectorizable. In this case the hardware prefetchers should work well and no further optimization is usually needed.
The compiler may or may not choose to build special case code to handle various relative sizes of "i" and "L", so it is possible that the processor is executing scalar code here. Scalar code is not always a performance problem -- it depends on the processor model and the size of the data range being accessed.
It is typically valuable to try to map out where the data is coming from and estimate how many cycles should be required to move the data through the memory hierarchy -- then to compare the measured execution time with the estimate. This requires seeing the rest of the code (to know when the "arr" data was last accessed, and how much data has been loaded between that point and this point) as well as knowing which processor generation and model is being used.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
In order to vectorize, compilers are likely to prefer an equivalent for() with both arrays using the same index with offset (or simd linear and safelen clauses) as well as __restrict definition (if not asserting simd with a compiler supporting it).
In case the compiler default assumptions about loop count aren't appropriate, intel compiler loop count pragma or profiler feedback may help.
Value of p being unknown at compile time may prevent simd optimization or recognition of efforts to assert i>j
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page