I am working on an image processing problem. It is similar to stencil computation in many aspects. When I compile it with "-g" option my program has about 30% speedup over naive version. However, if I compile with any optimization option such as "O3" or even just without "-g" option my code is considerably slower (even two times for large images) than the naive version. Can anyone suggest me where should I look for the solution?
I am using icpc as compiler, I have tried my code on many machines Xeon, Opteron, Core i7 etc. - similar performance everywhere. The images are converted into single precision arrays using CImg library and then I operate on arrays.
Why my code should be fast is because I use 1) data level blocking in my version 2) In place storage as opposed to out of place in naive version.
-g without any -O option implies -O0. -O2 and -O3 optimize for loop trip counts of at least 100. If your trip counts are small enough, it's possible the compiler makes the wrong assumptions when optimizing. -O1 is less likely to encounter such problems. You could try -unroll0; I've seen it help even for fairly large trip counts. Profile guided optimization (-prof-gen .... -prof-use) is intended to help the compiler make better assumptions for optimization. The "12.0" xe 2011 bring back #pragma loop count(10) as an alternative to PGO to inform the compiler if you want a target loop length 10 for optimization.