Re: Unexpected degradation in performance when loop counter is

biplabraut · ‎03-03-2009

Dear all,
I have been facing this problem since 2 days. There are two for loops (inner and outer nested loops)in my code whose performance I measure using QueryPerformanceCounter. Now the moment I increment the counter of inner loop/outer loopbeyondcertainvalues in respective cases, there is sudden jump(in several hundreds)in the result obtained fromQueryPerformanceCounter. So, this gives a degraded performancefor the piece of code from the expected value. Also these loops are containing simple statements(no conditionals). There is another similar nested loop having conditional statements which does not behave like this in performance-wise.
Any clues in this regard will be really helpful and appreciated.

With regards,
S. Biplab Raut

TimP · ‎03-03-2009

Have you compared your results with typical tests for effective cache sizes, looked at memory usage, or investigated the event counters?

jimdempseyatthecove · ‎03-03-2009

Can you post the loops?

Two probable causes

1) you are unrolling the inner loop to the point where you exceed the size of the instruction cache.
2) The size of the data being manipulated exceeds a cache size (bumping into next cache level or RAM)

Jim Dempsey

srimks · ‎03-03-2009

Quoting - biplabraut

Dear all,
I have been facing this problem since 2 days. There are two for loops (inner and outer nested loops) in my code whose performance I measure using QueryPerformanceCounter. Now the moment I increment the counter of inner loop/outer loop beyond certain values in respective cases, there is sudden jump(in several hundreds) in the result obtained from QueryPerformanceCounter. So, this gives a degraded performance for the piece of code from the expected value. Also these loops are containing simple statements(no conditionals). There is another similar nested loop having conditional statements which does not behave like this in performance-wise.
Any clues in this regard will be really helpful and appreciated.

With regards,
S. Biplab Raut

Normally, outer loop vectorization are traditionally been performed by interchanging an outer-loop with the innermost loop, followed by vectorizing it at the innermost position. A more direct unroll-and-jam approach can be used to vectorize an outer-loop without involving loop interchange, which can be especially suitable for short SIMD architectures.

Current optimizing compilers do not apply outer-loop vectorization in general. Why don't you focus only on inner loop and check it's benefits.

Could you share the code or maybe it's sample?

~BR

biplabraut · ‎03-04-2009

Quoting - tim18

Have you compared your results with typical tests for effective cache sizes, looked at memory usage, or investigated the event counters?

Hi.. Thank you for your reply.. I have checked the cache sizes on my system - L1 2x32 KB for both code and data, and L2 6MB. My function is in C and not in SIMD. There are two for loops likefor(){ for(){ } } in the code. When the respective loop counters are run till different values in different test runs, like outer one till 1200 and inner one 300, the performance degradation happens many fold.
There is no problem in the SIMD equivalent.
Can u suggest something on this ?

biplabraut · ‎03-04-2009

Quoting - jimdempseyatthecove

Can you post the loops?

Two probable causes

1) you are unrolling the inner loop to the point where you exceed the size of the instruction cache.
2) The size of the data being manipulated exceeds a cache size (bumping into next cache level or RAM)

Jim Dempsey

Thank you for ur suggestions.
But, the function is in C and I am not unrolling as in SIMD.
The cache sizes on my system - L1 2x32 KB for both code and data, and L2 6MB.The two for loops are like -> for(){ for(){ } } in the code. When the respective loop counters are run till different values in different test runs, like outer one till 1200 and inner one 300, the performance degradation happens many fold.
There is no problem in the SIMD equivalent.
C Function performance degrades drastically after the counters (x, y) of the two loops(outer and inner) pass certain counter values. eg :- When (x,y) are (1200,300), performance measured by QPC is 570 us. But when (x,y) become (1200, 301), performance degrades to 1890 us.

Awaiting ur reply..

srimks · ‎03-04-2009

Quoting - biplabraut

Quoting - jimdempseyatthecove

Can you post the loops?

Two probable causes

1) you are unrolling the inner loop to the point where you exceed the size of the instruction cache.
2) The size of the data being manipulated exceeds a cache size (bumping into next cache level or RAM)

Jim Dempsey

Thank you for ur suggestions.
But, the function is in C and I am not unrolling as in SIMD.
The cache sizes on my system - L1 2x32 KB for both code and data, and L2 6MB. The two for loops are like -> for(){ for(){ } } in the code. When the respective loop counters are run till different values in different test runs, like outer one till 1200 and inner one 300, the performance degradation happens many fold.
There is no problem in the SIMD equivalent.

Awaiting ur reply..

Can you simply perform "pragma unroll (4)" on the beginning of the OUTER LOOP.

If the code is C, the current compiler will perform minimal vectorization at some default level of SSE. ICC-v11.0 has default level of SSE2, so it will vectorize your application without mentioning SSE2 in cokmand line to SSE2 level at minimal.

~BR

biplabraut · ‎03-04-2009

Quoting - srimks

Can you simply perform "pragma unroll (4)" on the beginning of the OUTER LOOP.

If the code is C, the current compiler will perform minimal vectorization at some default level of SSE. ICC-v11.0 has default level of SSE2, so it will vectorize your application without mentioning SSE2 in cokmand line to SSE2 level at minimal.

~BR

Hi ,
I tried with #pragma unroll(4) and also with #pragma unroll(1) before the outerFOR loop. But, it didnt improve. Still the function performance degradesdrastically after the counters (x, y) of the two loops(outer and inner) pass certain counter values. eg :- When (x,y) are (1200,300), performance measured by QPC is 570 us. But when (x,y) become (1200, 301), performance degrades to 1890 us.

With Regards...

TimP · ‎03-04-2009

I can't see whether the likely suggestion of a data cache capacity issue was ever confirmed. If so, unroll-and-jam techniques are off the mark, until a cache blocking scheme is in use. It would take only about 2 sentences to tell us the size in bytes of your data array and of your caches, which we tried to persuade you to consider days ago.

jimdempseyatthecove · ‎03-04-2009

Do the changes in x and y (loop iteration) affect indexing of data (i.e. using a larger data set)?

Can you copy and paste the code?

Jim

srimks · ‎03-04-2009

Quoting - biplabraut

Hi ,
I tried with #pragma unroll(4) and also with #pragma unroll(1) before the outerFOR loop. But, it didnt improve. Still the function performance degradesdrastically after the counters (x, y) of the two loops(outer and inner) pass certain counter values. eg :- When (x,y) are (1200,300), performance measured by QPC is 570 us. But when (x,y) become (1200, 301), performance degrades to 1890 us.

With Regards...

Could you tell few things -

(a) The data types used in code and is the code multi-C file package or single file?
(b) What does your code does basically, can you tell in brief?
(c) Since it is in C, it must be having "struct", could you check if SoA (Structure of Arrays) is needed?
(d) Could you try performing "Loop-Blocking" for both OUTER & INNER LOOP together?
(e) Could you tell me the option given to execute this file, I mean the CFLAGS, CPPFLAGS, LDFLAGS, etc.
(f) Could you try performing Loop-splitting?
(g) Could you share the inner statements of INNER LOOP?

~BR

biplabraut · ‎03-05-2009

Quoting - tim18

I can't see whether the likely suggestion of a data cache capacity issue was ever confirmed. If so, unroll-and-jam techniques are off the mark, until a cache blocking scheme is in use. It would take only about 2 sentences to tell us the size in bytes of your data array and of your caches, which we tried to persuade you to consider days ago.

It is an issue of data cache capacity / L2 cache capacity issue. Though I could not find agood figure for cache misses using Vtune for this function. But the data blocks are large more than 400-500 KB. I applied Loop Blocking in the Inner and Outer loops, but performance improvement is not much. Now I do blocks of 64x64 processing for 4 bytes of data in the loops using Loop blocking.
Can you explain a bit more what u meant in your comment.

levicki · ‎03-08-2009

Without telling us the exact data size processed per INNER loop iteration there is no point in discussing this further, especially if you cannot create and post a reproducible test case. There are many variables that may affect performance and without knowing them every attempt at helping you is just a stab in the dark.

Unexpected degradation in performance when loop counter is increased