ICC 12.1: Bug in vectorization/optimization support?!

thesaint87 · ‎01-19-2012

Hi,

yesterday I have posted an innocent question on stackoverflow comparing MSVC, ICC and Java at handling a simple for-loop (URL http://stackoverflow.com/questions/8919834/c-vs-java-simple-loop-shows-frustrating-results). The results were like this:

Intel x86 (without vectorization): 3 seconds
MSVC x64: 5 seconds
Java x86/x64: 7 seconds
Intel x64 (with vectorization): 9.5 seconds
Intel x86 (with vectorization): 9.5 seconds
Intel x64 (without vectorization): 12 seconds
MSVC x86: 15 seconds (uhh)

The bottom line is that the ICC seems to fail at vectorization this simple loop and alsothe x64 build is very slow.

Additionally, the auto-parallelization fails either. If enabled, the code consumes about 70% CPU, instead of 25%,

but doesn't run any faster and compared to the winning build even 3 times slower (the winning build runs on one thread!).

This is really strange!

Since I want to create high performance programs and I thought the ICC would come in handy, now I am a bit shocked about

these results. Maybe someone can shed some light to this?

PS: I don't think posting the entire question here is a good idea ;). Just follow the link, were you can find the source

code as well as disassemblies of ICC and MSVC builds.

regards

chris

TimP · ‎01-19-2012

Anything can happen when you time sections of code whose result is unused. You should look at generated code and compiler reports to see what is happening, besides writing a benchmark which isn't full of dead code.
icc 12 expects you to try the #pragma reduction (+:.... for vectorization of sum reduction.

styc · ‎01-19-2012

Yetanother example of why mixed-type vectorization should be avoided. Without PMOVSXDQ from SSE 4.1, vector sign extension is just super-awkward.

SergeyKostrov · ‎01-19-2012

It is not recommended to use STL templates when doing performance evaluations of different
C/C++compilers because they have different implementations of STL.

Ideally,atest code has to be witten in C and should beas generic as possible. This is because C++
operators, like '=','+=', '*','[]', etc, could be inderectlyused. And of courseall of themwill have different
implementations depending on a C/C++ compiler.

In your test codeC++ operatorsare not used since 'arrPtr' is used instead of 'arr'.

Did you test in Debug or Release configuration in all cases?

In your test code:
...
long long var = 0;
std::array arr;
int *arrPtr = arr.data();
CHighPrecisionTimer timer;

for(int i = 0; i < 1024; i++)
arrPtr = i;

timer.Start();

for(int i = 0; i < 1024 * 1024 * 10; i++)
{
for(int x = 0; x < 1024; x++)
{
var += arrPtr;
}
}

timer.Stop();
...

there is a "hidden" cast from 'int' to 'long long' here:
...
var += arrPtr;
...
but I'm not sure that it affects performance results. I would try tocompare withatype 'float' as well.

Java performance results are good butI would rather compare performance ofJavacodes with
performance of.NET codes, instead of C/C++ codes.

Best regards,
Sergey

thesaint87 · ‎01-19-2012

@TimP:

If you look closer you would see that the result is used...

If you look closer at the link to stackoverflow you would see that there is plenty of compiler generated code posted...

You would see, that the benchmark contains absolutely no dead code...

Well, the last sentence is actually the only that makes sense, I might try that later!

@Sergey:

There is no STL involved. The array gets casted into a pointer...

The hidden conversion is a point but should not cause any trouble since it is an increase of precision.

Why compare Java and C# if I am writing my software in C++ ^^. After all, this was meant to be just a private test but obviously yielded unexpected results cause by misbehaviour of the ICC optimizations... ICC 11 get's it right out of the box (validated by someone else on Stackoverflow who has it). So the quirky results will have their origin in all the fancy optimization upgrades that came with ICC 12 and don't seem to work soo well?!

TimP · ‎01-19-2012

Yes, the difference in overhead between 32-bit to 64-bit long promotion on platforms which have a native 64-bit int and others which don't might be expected to affect the timing comparison.
It would be easier to simply add in a check of the result to make certain that one is actually produced and that the change in data types don't affect it.

thesaint87 · ‎01-19-2012

@TimP: I mean the printf() of the result shall be enough?!

Additionally, what this is all about is not subtleties with the code itself. Its about that other compilers, and also the ICC 11 obviously get it right, but ICC 12 doesn't. So there is not much to argue about the code I guess. Of course creators of ICC 12 could defend themselves like that but it is not really a great position to be in ;). If it fail to generate good code for this trivial example how could it for a lot more complex loop semantics...

TimP · ‎01-19-2012

Yes, displaying result in printf() prevents possible elimination of code (or catches it, if done incorrectly).

thesaint87 · ‎01-19-2012

Just strumbled upon another even more desasterous example. With all optimizations maxed (except the vectorization and auto-parallelization which in my examples are proven to fail anyway so far), the Intel Compiler manages to be 90 times !! slower than MCSV 2010... I mean this is a little scary, don't you think?! I suppose even C# manages to be faster with lambda expressions... So why would one use such small lambdas in loops with countless iterations?! Simply because this way you can get rid of the god damn iterators, which I really hate in C++. But this requires the compiler to generate at least "good" code for lambdas, which MSVC does but well, what about ICC?

#include 
#include 
#include 
#include 
#include 
#include 
#include 

template
struct ArrayList
{
private:
  std::vector m_Entries;
public:

  template
  void Foreach(TCallback inCallback)
  {
    for(int i = 0, size = m_Entries.size(); i < size; i++)
    {
      inCallback(i);
    }
  }

  void Add(TValue inValue)
  {
    m_Entries.push_back(inValue);
  }
};

int _tmain(int argc, _TCHAR* argv[])
{
  auto t = [&]() {};


  ArrayList arr;
  int res = 0;

  for(int i = 0; i < 100; i++)
  {
    arr.Add(i);
  }

  long long freq, t1, t2;

  QueryPerformanceFrequency((LARGE_INTEGER*)&freq);
  QueryPerformanceCounter((LARGE_INTEGER*)&t1);

  for(int i = 0; i < 1000 * 1000 * 10; i++)
  {
    arr.Foreach([&](int v) {
      res += v;
    });
  }

  QueryPerformanceCounter((LARGE_INTEGER*)&t2);

  printf("Time: %lld\n", ((t2-t1) * 1000000) / freq);

  if(res == 4950)
    return -1;

  return 0;
}

SergeyKostrov · ‎01-19-2012

Quoting thesaint87

...
@Sergey:

There is no STL involved. The array gets casted into a pointer...

[SergeyK] Have you read my comments carefully? Please take a look at my comment
marked with '>>'

The hidden conversion is a point but should not cause any trouble since it is an increase of precision.

[SergeyK] What kind of precisioncould you get on integers?

...

>>...
>>In your test codeC++ operatorsare not used since 'arrPtr' is used instead of 'arr'.
>>...

Best regards,
Sergey