- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I think I have encountered an issue similar to what Chuck DeSylva found (as described in Intel Visual Adrenalin, Issue No. 7, 2010, Page 10, last para):
According to deSylva, Cryptic wanted Star Trek Online to
be able to run at 25 frames per second, which is typical for
MMOs. Cryptic also wanted to support an ber Shader.
They didnt want to have to load multiple shaders for
different materials, so they had one huge shader, which they
ifd out in sections for different lighting scenarios. But the
shader wasnt getting compiled properly. I was able to use
various experiments to determine that it was stalling out on
the back end of our GPU pipeline. When we turned off the
ber Shader, we basically doubled the performance. However,
we found that the ber Shader was running fine; it was
actually a problem in our compilation. So it was kind of an
interesting case where we helped them to help us. It was a
good engagement from that standpoint.
Can someone at Intel please give me Chuck DeSylva's contact so I can find out how they resolved this issue?
I have a code that is very long and we use ifdefs to conditionally compile only parts of it within a block of if-else statements. What I have observed is that the performance suddenly jumps 40% when I comment out some part of this code permenantly. This is weird as the compiler's optimization seems to not work when the code becomes longer.
I would really appreciate if someone from Intel C++ compiler team can help me out here.
Thanks!
According to deSylva, Cryptic wanted Star Trek Online to
be able to run at 25 frames per second, which is typical for
MMOs. Cryptic also wanted to support an ber Shader.
They didnt want to have to load multiple shaders for
different materials, so they had one huge shader, which they
ifd out in sections for different lighting scenarios. But the
shader wasnt getting compiled properly. I was able to use
various experiments to determine that it was stalling out on
the back end of our GPU pipeline. When we turned off the
ber Shader, we basically doubled the performance. However,
we found that the ber Shader was running fine; it was
actually a problem in our compilation. So it was kind of an
interesting case where we helped them to help us. It was a
good engagement from that standpoint.
Can someone at Intel please give me Chuck DeSylva's contact so I can find out how they resolved this issue?
I have a code that is very long and we use ifdefs to conditionally compile only parts of it within a block of if-else statements. What I have observed is that the performance suddenly jumps 40% when I comment out some part of this code permenantly. This is weird as the compiler's optimization seems to not work when the code becomes longer.
I would really appreciate if someone from Intel C++ compiler team can help me out here.
Thanks!
Link Copied
2 Replies
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I don't have the organizational qualification you requested, but likely possibilities include exceeding one or more of the default quotas which may be increased by /Qinline- options (at your own risk).
I can't tell whether you mean to imply you are programming for GPU (not a released compiler which would be topical here).
I can't tell whether you mean to imply you are programming for GPU (not a released compiler which would be topical here).
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
>>What I have observed is that the performance suddenly jumps 40% when I comment out some part of this code permenantly.
Do you mean
#if 0
// comment out perminantly
...
#endif
As opposed to
#if defined(MacroNotDefinedHere)
// comment outconditionally
...
#endif
Or do you mean cut the code out of the source file?
Also, you may be taxing your register usage and/or inefficiently using your L2/L3 cache system.
Register usage might be improved reworking loop nest order and/or porting code out of line (make function call). Inlining does not always improve performance, in particular where the inlined code (of high activity loop)does not fit within the Instruction Cache where the non-inlined code will fit in the Instruction Cache.
The inefficiently using your L2/L3 cache system may be addressed by the strategy of how you pass the data through your shader filters:
method 1:
grab all of data
filter-1 on all data
filter-2 on all data
...
filter-n on all data
end grab all data
method-2:
for(slice = 0; slice .lt. nSlices; ++slice)
grabsliceof data
filter-1 onslice of data
filter-2 onslice of data
...
filter-n onslice ofdata
end grabslice ofdata
end for
And for either/both methods, are you using multi-threaded programming (OpenMP, Cilk++, TBB, other)?
If so, how are youparallelizing the work: intra filter or inter filter?
If inter filter, have you explored parallel_pipeline techniques?
Jim Dempsey
Do you mean
#if 0
// comment out perminantly
...
#endif
As opposed to
#if defined(MacroNotDefinedHere)
// comment outconditionally
...
#endif
Or do you mean cut the code out of the source file?
Also, you may be taxing your register usage and/or inefficiently using your L2/L3 cache system.
Register usage might be improved reworking loop nest order and/or porting code out of line (make function call). Inlining does not always improve performance, in particular where the inlined code (of high activity loop)does not fit within the Instruction Cache where the non-inlined code will fit in the Instruction Cache.
The inefficiently using your L2/L3 cache system may be addressed by the strategy of how you pass the data through your shader filters:
method 1:
grab all of data
filter-1 on all data
filter-2 on all data
...
filter-n on all data
end grab all data
method-2:
for(slice = 0; slice .lt. nSlices; ++slice)
grabsliceof data
filter-1 onslice of data
filter-2 onslice of data
...
filter-n onslice ofdata
end grabslice ofdata
end for
And for either/both methods, are you using multi-threaded programming (OpenMP, Cilk++, TBB, other)?
If so, how are youparallelizing the work: intra filter or inter filter?
If inter filter, have you explored parallel_pipeline techniques?
Jim Dempsey

Reply
Topic Options
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page