Intel® C++ Compiler
Community support and assistance for creating C++ code that runs on platforms based on Intel® processors.
7953 Discussions

Unexpected degradation in performance when loop counter is increased

biplabraut
Beginner
846 Views
Dear all,
I have been facing this problem since 2 days. There are two for loops (inner and outer nested loops)in my code whose performance I measure using QueryPerformanceCounter. Now the moment I increment the counter of inner loop/outer loopbeyondcertainvalues in respective cases, there is sudden jump(in several hundreds)in the result obtained fromQueryPerformanceCounter. So, this gives a degraded performancefor the piece of code from the expected value. Also these loops are containing simple statements(no conditionals). There is another similar nested loop having conditional statements which does not behave like this in performance-wise.
Any clues in this regard will be really helpful and appreciated.

With regards,
S. Biplab Raut
0 Kudos
12 Replies
TimP
Honored Contributor III
846 Views
Have you compared your results with typical tests for effective cache sizes, looked at memory usage, or investigated the event counters?
0 Kudos
jimdempseyatthecove
Honored Contributor III
846 Views

Can you post the loops?

Two probable causes

1) you are unrolling the inner loop to the point where you exceed the size of the instruction cache.
2) The size of the data being manipulated exceeds a cache size (bumping into next cache level or RAM)

Jim Dempsey
0 Kudos
srimks
New Contributor II
846 Views
Quoting - biplabraut
Dear all,
I have been facing this problem since 2 days. There are two for loops (inner and outer nested loops) in my code whose performance I measure using QueryPerformanceCounter. Now the moment I increment the counter of inner loop/outer loop beyond certain values in respective cases, there is sudden jump(in several hundreds) in the result obtained from QueryPerformanceCounter. So, this gives a degraded performance for the piece of code from the expected value. Also these loops are containing simple statements(no conditionals). There is another similar nested loop having conditional statements which does not behave like this in performance-wise.
Any clues in this regard will be really helpful and appreciated.

With regards,
S. Biplab Raut
Normally, outer loop vectorization are traditionally been performed by interchanging an outer-loop with the innermost loop, followed by vectorizing it at the innermost position. A more direct unroll-and-jam approach can be used to vectorize an outer-loop without involving loop interchange, which can be especially suitable for short SIMD architectures.

Current optimizing compilers do not apply outer-loop vectorization in general. Why don't you focus only on inner loop and check it's benefits.

Could you share the code or maybe it's sample?

~BR


0 Kudos
biplabraut
Beginner
846 Views
Quoting - tim18
Have you compared your results with typical tests for effective cache sizes, looked at memory usage, or investigated the event counters?
Hi.. Thank you for your reply.. I have checked the cache sizes on my system - L1 2x32 KB for both code and data, and L2 6MB. My function is in C and not in SIMD. There are two for loops likefor(){ for(){ } } in the code. When the respective loop counters are run till different values in different test runs, like outer one till 1200 and inner one 300, the performance degradation happens many fold.
There is no problem in the SIMD equivalent.
Can u suggest something on this ?
0 Kudos
biplabraut
Beginner
846 Views

Can you post the loops?

Two probable causes

1) you are unrolling the inner loop to the point where you exceed the size of the instruction cache.
2) The size of the data being manipulated exceeds a cache size (bumping into next cache level or RAM)

Jim Dempsey
Thank you for ur suggestions.
But, the function is in C and I am not unrolling as in SIMD.
The cache sizes on my system - L1 2x32 KB for both code and data, and L2 6MB.The two for loops are like -> for(){ for(){ } } in the code. When the respective loop counters are run till different values in different test runs, like outer one till 1200 and inner one 300, the performance degradation happens many fold.
There is no problem in the SIMD equivalent.
C Function performance degrades drastically after the counters (x, y) of the two loops(outer and inner) pass certain counter values. eg :- When (x,y) are (1200,300), performance measured by QPC is 570 us. But when (x,y) become (1200, 301), performance degrades to 1890 us.

Awaiting ur reply..
0 Kudos
srimks
New Contributor II
846 Views
Quoting - biplabraut

Can you post the loops?

Two probable causes

1) you are unrolling the inner loop to the point where you exceed the size of the instruction cache.
2) The size of the data being manipulated exceeds a cache size (bumping into next cache level or RAM)

Jim Dempsey
Thank you for ur suggestions.
But, the function is in C and I am not unrolling as in SIMD.
The cache sizes on my system - L1 2x32 KB for both code and data, and L2 6MB. The two for loops are like -> for(){ for(){ } } in the code. When the respective loop counters are run till different values in different test runs, like outer one till 1200 and inner one 300, the performance degradation happens many fold.
There is no problem in the SIMD equivalent.

Awaiting ur reply..
Can you simply perform "pragma unroll (4)" on the beginning of the OUTER LOOP.

If the code is C, the current compiler will perform minimal vectorization at some default level of SSE. ICC-v11.0 has default level of SSE2, so it will vectorize your application without mentioning SSE2 in cokmand line to SSE2 level at minimal.

~BR
0 Kudos
biplabraut
Beginner
846 Views
Quoting - srimks
Can you simply perform "pragma unroll (4)" on the beginning of the OUTER LOOP.

If the code is C, the current compiler will perform minimal vectorization at some default level of SSE. ICC-v11.0 has default level of SSE2, so it will vectorize your application without mentioning SSE2 in cokmand line to SSE2 level at minimal.

~BR
Hi ,
I tried with #pragma unroll(4) and also with #pragma unroll(1) before the outerFOR loop. But, it didnt improve. Still the function performance degradesdrastically after the counters (x, y) of the two loops(outer and inner) pass certain counter values. eg :- When (x,y) are (1200,300), performance measured by QPC is 570 us. But when (x,y) become (1200, 301), performance degrades to 1890 us.

With Regards...
0 Kudos
TimP
Honored Contributor III
846 Views
I can't see whether the likely suggestion of a data cache capacity issue was ever confirmed. If so, unroll-and-jam techniques are off the mark, until a cache blocking scheme is in use. It would take only about 2 sentences to tell us the size in bytes of your data array and of your caches, which we tried to persuade you to consider days ago.
0 Kudos
jimdempseyatthecove
Honored Contributor III
846 Views

Do the changes in x and y (loop iteration) affect indexing of data (i.e. using a larger data set)?

Can you copy and paste the code?

Jim

0 Kudos
srimks
New Contributor II
846 Views
Quoting - biplabraut
Hi ,
I tried with #pragma unroll(4) and also with #pragma unroll(1) before the outerFOR loop. But, it didnt improve. Still the function performance degradesdrastically after the counters (x, y) of the two loops(outer and inner) pass certain counter values. eg :- When (x,y) are (1200,300), performance measured by QPC is 570 us. But when (x,y) become (1200, 301), performance degrades to 1890 us.

With Regards...

Could you tell few things -

(a) The data types used in code and is the code multi-C file package or single file?
(b) What does your code does basically, can you tell in brief?
(c) Since it is in C, it must be having "struct", could you check if SoA (Structure of Arrays) is needed?
(d) Could you try performing "Loop-Blocking" for both OUTER & INNER LOOP together?
(e) Could you tell me the option given to execute this file, I mean the CFLAGS, CPPFLAGS, LDFLAGS, etc.
(f) Could you try performing Loop-splitting?
(g) Could you share the inner statements of INNER LOOP?

~BR
0 Kudos
biplabraut
Beginner
846 Views
Quoting - tim18
I can't see whether the likely suggestion of a data cache capacity issue was ever confirmed. If so, unroll-and-jam techniques are off the mark, until a cache blocking scheme is in use. It would take only about 2 sentences to tell us the size in bytes of your data array and of your caches, which we tried to persuade you to consider days ago.
It is an issue of data cache capacity / L2 cache capacity issue. Though I could not find agood figure for cache misses using Vtune for this function. But the data blocks are large more than 400-500 KB. I applied Loop Blocking in the Inner and Outer loops, but performance improvement is not much. Now I do blocks of 64x64 processing for 4 bytes of data in the loops using Loop blocking.
Can you explain a bit more what u meant in your comment.
0 Kudos
levicki
Valued Contributor I
846 Views
Without telling us the exact data size processed per INNER loop iteration there is no point in discussing this further, especially if you cannot create and post a reproducible test case. There are many variables that may affect performance and without knowing them every attempt at helping you is just a stab in the dark.


0 Kudos
Reply