I think I have encountered an issue similar to what Chuck DeSylva found (as described in Intel Visual Adrenalin, Issue No. 7, 2010, Page 10, last para):
According to deSylva, Cryptic wanted Star Trek Online to be able to run at 25 frames per second, which is typical for MMOs. Cryptic also wanted to support an ber Shader.
They didnt want to have to load multiple shaders for different materials, so they had one huge shader, which they ifd out in sections for different lighting scenarios. But the shader wasnt getting compiled properly. I was able to use various experiments to determine that it was stalling out on the back end of our GPU pipeline. When we turned off the ber Shader, we basically doubled the performance. However, we found that the ber Shader was running fine; it was actually a problem in our compilation. So it was kind of an interesting case where we helped them to help us. It was a good engagement from that standpoint.
Can someone at Intel please give me Chuck DeSylva's contact so I can find out how they resolved this issue?
I have a code that is very long and we use ifdefs to conditionally compile only parts of it within a block of if-else statements. What I have observed is that the performance suddenly jumps 40% when I comment out some part of this code permenantly. This is weird as the compiler's optimization seems to not work when the code becomes longer.
I would really appreciate if someone from Intel C++ compiler team can help me out here.
I don't have the organizational qualification you requested, but likely possibilities include exceeding one or more of the default quotas which may be increased by /Qinline- options (at your own risk). I can't tell whether you mean to imply you are programming for GPU (not a released compiler which would be topical here).
Or do you mean cut the code out of the source file?
Also, you may be taxing your register usage and/or inefficiently using your L2/L3 cache system.
Register usage might be improved reworking loop nest order and/or porting code out of line (make function call). Inlining does not always improve performance, in particular where the inlined code (of high activity loop)does not fit within the Instruction Cache where the non-inlined code will fit in the Instruction Cache.
The inefficiently using your L2/L3 cache system may be addressed by the strategy of how you pass the data through your shader filters:
grab all of data filter-1 on all data filter-2 on all data ... filter-n on all data end grab all data
for(slice = 0; slice .lt. nSlices; ++slice) grabsliceof data filter-1 onslice of data filter-2 onslice of data ... filter-n onslice ofdata end grabslice ofdata end for
And for either/both methods, are you using multi-threaded programming (OpenMP, Cilk++, TBB, other)? If so, how are youparallelizing the work: intra filter or inter filter? If inter filter, have you explored parallel_pipeline techniques?