speed

Bale_D_ · ‎08-13-2014

I want to speed up my program by intel C++ compiler. To some program, it improves a lot, but to some program, it can make it faster comparing to Visual studio C++. I want to know why. Would someone give me some idea or advice.

Om_S_Intel · ‎08-13-2014

You may analyze your application using Intel VTune amplifier XE to know which code segment your application is spending time and then try to vectorize and parallelize the code. You can also find out the architectural reason why your application is running slow using VTune.

Thanks,

Om

Sukruth_H_Intel · ‎08-13-2014

Hi Bale,

That's a pretty big question!! because there are lot of things that can be mentioned as how to improve the perf and also the methods to improve the performance, In brief, You may concentrate on Vectorization and Parallelization to extract good performance out of your application using icc on Intel architecture.

You may have a look at the user and reference guides of the compiler and especially the "Key features" section about the various methodologies :-

https://software.intel.com/sites/products/documentation/doclib/iss/2013/compiler/cpp-lin/

Regards,

Sukruth H V

Bernard · ‎08-14-2014

@Bale

You can follow @om-sachn advise and use VTune for in-depth code analysis, pay attention to the results of Front-End stalls and Back-End stalls. Your code performance can be also dependent on interaction with OS so I would advise to analyse it with Xperf in case you use Windows. From the programing point of view, you can help compiler to recognize constant compile-time variables by using constexpr keyword in order to force compiler to perform calculation at compile-time and that's speeding up execution at runtime.

TimP · ‎08-14-2014

If the question is about code which runs faster under MSVC++ than ICL, it likely pertains to non-vectorizable source code. For example, ICL frequently misses scalar replacement between loop iterations (where a value is stored to memory on one iteration and read back on the next). Then you need to define a local scalar copy as well.

Less often, there are vectorizable cases to be solved in a similar way:

      float tmp = a[1];
      for (i__ = 2; i__ <= i__2; ++i__)
          a[i__] = tmp + b[i__];

but the following case is non-vectorizable and won't be optimized by ICL without help to scalar replace the serial dependency:

          float tmp = aa[1 + j*aa_dim1];
          for (i__ = 2; i__ <= i__3; ++i__)
              aa[i__ + j * aa_dim1] = tmp += aa[i__ + (j - 1) * aa_dim1];

-Qunroll2 may help as well (-Qunroll4 with 15.0 compiler). Default automatic unrolling selection is worse for non-vectorizable code.

In case where ICL is too aggressive with vectorization (when __restrict is in use), you would over-ride with pragmas:

float *__restrict a, float *__restrict b, float *__restrict d, float *__restrict c;

#ifndef __MIC__
#pragma novector
#pragma unroll(4)
#endif
      for (i__ = 1; i__ <= i__2; ++i__) {
          a[i__] = b[k = j + 1] - d__[i__];
          j = k + 1;
          b = a[i__] + c__;

Note that either vectorization or insufficient unroll would account for performance deficit compared with CL, but the Intel(r) Xeon Phi(tm) platform needs vectorization here, so it seems that Intel C++ doesn't adequately differentiate optimizations among platforms.