I've put loop count information before some loops using:
the optimisation report says:
Loop with pragma of trip count = 1000 ignored for large value
I've trawled through the Intel documentation but I'm still uncertain if I'm using this pragma correctly in the light of this report statement
More usual usage, to suggest optimizing for about that loop count, would be #pragma loop count avg(1000). Unless you want to optimize for an exact loop count (or a list of such counts), the avg, min, and max clauses may be important. I don't think the wording of the documentation was thought out.
We're attempting to ensure that we get best results from the compiler generating AVX instructions. So we thought that if we constrained the array member count to be a multiple of 4 this would help with sub-arrays within multi-dimensional arrays being aligned on 32 byte boundaries for doubles. Hence the use of a large count value which modulo 4 = 0.
I find the documentation confusing as there's references to trip counts of 1000 from memory in some examples and in other places it says that the trip count may be used as the loop unrolling value. Which would be fine for small arrays but we're dealing with larger indices.
I was hoping that if we constrained the arrays in this manner we would remove the need for remainder loops that would in turn reduce the code size allowing better use of the instruction cache.
When you set a large loop count, it can't eliminate the loop by unrolling. That may be the reason you get the message that it's ignored.
I've found that AVX frequently needs unroll(4) for best performance. In general, that increases the importance of vectorized remainder loops. Assuming 64-bit data, the total trip count would be a multiple of 16 to avoid executing remainder loops. You may be able to turn off vectorization of remainder loops (for example, unroll(0) would do it), which would reduce code size, but i-cache misses shouldn't be a significant factor when running loops with large count. I don't think the compiler will allow for removing remainder loops entirely, except in the case of a fixed loop count with known aligned data. I think you would need to collect events, e.g. with VTune, to see whether you spend measurable time on i-cache misses and whether you are able to improve on them.
I see that the spell corrector doesn't like either American nor British spelling of vectorization.
Thanks again, Tim. Good points re modulo 16 for avoiding remainder loops with code unrolled 4x and checking to see if i-cache misses are an issue with this code. Always a good idea in software performance work to get operational data as against surmise!
And I should have said that the optimisation report and direct inspection of the assembler in Vectorisation Adviser indicates that the code - compiled with the AVX switch - has been unrolled 4x by the compiler without needing a pragma unroll.