- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Dear all,
I have been facing this problem since 2 days. There are two for loops (inner and outer nested loops)in my code whose performance I measure using QueryPerformanceCounter. Now the moment I increment the counter of inner loop/outer loopbeyondcertainvalues in respective cases, there is sudden jump(in several hundreds)in the result obtained fromQueryPerformanceCounter. So, this gives a degraded performancefor the piece of code from the expected value. Also these loops are containing simple statements(no conditionals). There is another similar nested loop having conditional statements which does not behave like this in performance-wise.
Any clues in this regard will be really helpful and appreciated.
With regards,
S. Biplab Raut
I have been facing this problem since 2 days. There are two for loops (inner and outer nested loops)in my code whose performance I measure using QueryPerformanceCounter. Now the moment I increment the counter of inner loop/outer loopbeyondcertainvalues in respective cases, there is sudden jump(in several hundreds)in the result obtained fromQueryPerformanceCounter. So, this gives a degraded performancefor the piece of code from the expected value. Also these loops are containing simple statements(no conditionals). There is another similar nested loop having conditional statements which does not behave like this in performance-wise.
Any clues in this regard will be really helpful and appreciated.
With regards,
S. Biplab Raut
Link Copied
12 Replies
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Have you compared your results with typical tests for effective cache sizes, looked at memory usage, or investigated the event counters?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Can you post the loops?
Two probable causes
1) you are unrolling the inner loop to the point where you exceed the size of the instruction cache.
2) The size of the data being manipulated exceeds a cache size (bumping into next cache level or RAM)
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Quoting - biplabraut
Dear all,
I have been facing this problem since 2 days. There are two for loops (inner and outer nested loops) in my code whose performance I measure using QueryPerformanceCounter. Now the moment I increment the counter of inner loop/outer loop beyond certain values in respective cases, there is sudden jump(in several hundreds) in the result obtained from QueryPerformanceCounter. So, this gives a degraded performance for the piece of code from the expected value. Also these loops are containing simple statements(no conditionals). There is another similar nested loop having conditional statements which does not behave like this in performance-wise.
Any clues in this regard will be really helpful and appreciated.
With regards,
S. Biplab Raut
I have been facing this problem since 2 days. There are two for loops (inner and outer nested loops) in my code whose performance I measure using QueryPerformanceCounter. Now the moment I increment the counter of inner loop/outer loop beyond certain values in respective cases, there is sudden jump(in several hundreds) in the result obtained from QueryPerformanceCounter. So, this gives a degraded performance for the piece of code from the expected value. Also these loops are containing simple statements(no conditionals). There is another similar nested loop having conditional statements which does not behave like this in performance-wise.
Any clues in this regard will be really helpful and appreciated.
With regards,
S. Biplab Raut
Current optimizing compilers do not apply outer-loop vectorization in general. Why don't you focus only on inner loop and check it's benefits.
Could you share the code or maybe it's sample?
~BR
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Quoting - tim18
Have you compared your results with typical tests for effective cache sizes, looked at memory usage, or investigated the event counters?
There is no problem in the SIMD equivalent.
Can u suggest something on this ?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Quoting - jimdempseyatthecove
Can you post the loops?
Two probable causes
1) you are unrolling the inner loop to the point where you exceed the size of the instruction cache.
2) The size of the data being manipulated exceeds a cache size (bumping into next cache level or RAM)
Jim Dempsey
But, the function is in C and I am not unrolling as in SIMD.
The cache sizes on my system - L1 2x32 KB for both code and data, and L2 6MB.The two for loops are like -> for(){ for(){ } } in the code. When the respective loop counters are run till different values in different test runs, like outer one till 1200 and inner one 300, the performance degradation happens many fold.
There is no problem in the SIMD equivalent.
C Function performance degrades drastically after the counters (x, y) of the two loops(outer and inner) pass certain counter values. eg :- When (x,y) are (1200,300), performance measured by QPC is 570 us. But when (x,y) become (1200, 301), performance degrades to 1890 us.
Awaiting ur reply..
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Quoting - biplabraut
Quoting - jimdempseyatthecove
Can you post the loops?
Two probable causes
1) you are unrolling the inner loop to the point where you exceed the size of the instruction cache.
2) The size of the data being manipulated exceeds a cache size (bumping into next cache level or RAM)
Jim Dempsey
But, the function is in C and I am not unrolling as in SIMD.
The cache sizes on my system - L1 2x32 KB for both code and data, and L2 6MB. The two for loops are like -> for(){ for(){ } } in the code. When the respective loop counters are run till different values in different test runs, like outer one till 1200 and inner one 300, the performance degradation happens many fold.
There is no problem in the SIMD equivalent.
Awaiting ur reply..
If the code is C, the current compiler will perform minimal vectorization at some default level of SSE. ICC-v11.0 has default level of SSE2, so it will vectorize your application without mentioning SSE2 in cokmand line to SSE2 level at minimal.
~BR
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Quoting - srimks
Can you simply perform "pragma unroll (4)" on the beginning of the OUTER LOOP.
If the code is C, the current compiler will perform minimal vectorization at some default level of SSE. ICC-v11.0 has default level of SSE2, so it will vectorize your application without mentioning SSE2 in cokmand line to SSE2 level at minimal.
~BR
If the code is C, the current compiler will perform minimal vectorization at some default level of SSE. ICC-v11.0 has default level of SSE2, so it will vectorize your application without mentioning SSE2 in cokmand line to SSE2 level at minimal.
~BR
I tried with #pragma unroll(4) and also with #pragma unroll(1) before the outerFOR loop. But, it didnt improve. Still the function performance degradesdrastically after the counters (x, y) of the two loops(outer and inner) pass certain counter values. eg :- When (x,y) are (1200,300), performance measured by QPC is 570 us. But when (x,y) become (1200, 301), performance degrades to 1890 us.
With Regards...
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I can't see whether the likely suggestion of a data cache capacity issue was ever confirmed. If so, unroll-and-jam techniques are off the mark, until a cache blocking scheme is in use. It would take only about 2 sentences to tell us the size in bytes of your data array and of your caches, which we tried to persuade you to consider days ago.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Do the changes in x and y (loop iteration) affect indexing of data (i.e. using a larger data set)?
Can you copy and paste the code?
Jim
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Quoting - biplabraut
Hi ,
I tried with #pragma unroll(4) and also with #pragma unroll(1) before the outerFOR loop. But, it didnt improve. Still the function performance degradesdrastically after the counters (x, y) of the two loops(outer and inner) pass certain counter values. eg :- When (x,y) are (1200,300), performance measured by QPC is 570 us. But when (x,y) become (1200, 301), performance degrades to 1890 us.
With Regards...
I tried with #pragma unroll(4) and also with #pragma unroll(1) before the outerFOR loop. But, it didnt improve. Still the function performance degradesdrastically after the counters (x, y) of the two loops(outer and inner) pass certain counter values. eg :- When (x,y) are (1200,300), performance measured by QPC is 570 us. But when (x,y) become (1200, 301), performance degrades to 1890 us.
With Regards...
Could you tell few things -
(a) The data types used in code and is the code multi-C file package or single file?
(b) What does your code does basically, can you tell in brief?
(c) Since it is in C, it must be having "struct", could you check if SoA (Structure of Arrays) is needed?
(d) Could you try performing "Loop-Blocking" for both OUTER & INNER LOOP together?
(e) Could you tell me the option given to execute this file, I mean the CFLAGS, CPPFLAGS, LDFLAGS, etc.
(f) Could you try performing Loop-splitting?
(g) Could you share the inner statements of INNER LOOP?
~BR
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Quoting - tim18
I can't see whether the likely suggestion of a data cache capacity issue was ever confirmed. If so, unroll-and-jam techniques are off the mark, until a cache blocking scheme is in use. It would take only about 2 sentences to tell us the size in bytes of your data array and of your caches, which we tried to persuade you to consider days ago.
Can you explain a bit more what u meant in your comment.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Without telling us the exact data size processed per INNER loop iteration there is no point in discussing this further, especially if you cannot create and post a reproducible test case. There are many variables that may affect performance and without knowing them every attempt at helping you is just a stab in the dark.
Reply
Topic Options
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page