Intel® C++ Compiler
Community support and assistance for creating C++ code that runs on platforms based on Intel® processors.

Performance issue with icpc

Andrea_V_
Beginner
327 Views

I'm currently experiencing some performance problems using the icpcp compiler in the composer_xe_2015 suite.
I attach the very simple driver I have used below: it has been compiled with the -O3 flag, the operating system is Suse 2011 SP1
 and the computer is a single node of a intel cluster with two soket Xeon X5675.
The driver takes nearly 33 seconds to run, if I remove the last two rows
  
  std::set<double> vect;
  vect.insert(2.0);

then I get 16 seconds. That seems strange. I have also built the code using gcc and I get 16 seconds: the g++ compiler is 4.3.4.
What am I missing? If I reduce the optimization level I do not get any improvement.

Thanks.
Andrea

 


#include <cmath>
#include <iostream>
#include <time.h>
#include <vector>
#include <set>

using namespace std;


int main(int argc, char *argv[])
{
  //Matmul---------------------------------------------------------------------
  static int N = 300;
  vector< vector<double> > A(N), B(N), C(N);
  time_t start, end;
  
  for(int i=0; i < N; ++i)
  {
    A.resize(N);
    B.resize(N);
    C.resize(N);
  }
  
  for(int i=0; i < N; ++i)
  {
    for(int j=0; j < N; ++j)
    {
      A = double(i*j) / (N*N*N);
      B = double(i*j) / (N*N*N);
    }
  }
  
  cout << "Start "; time(&start); cout << endl;
  
  for(int z=0; z < 300; ++z)
  {
    for(int i=0; i < N; ++i)
    {
      for(int j=0; j < N; ++j)
      {
    C = 0.0;
    
    for(int k=0; k < N; ++k)
    { C += A * B; }
      }
    }
    
    for(int i=0; i < N; ++i)
    {
      for(int j=0; j < N; ++j)
      { A = C; }
    }
  }
  
  time(&end); cout << "done (" << difftime(end, start) << " s)" << endl << endl;
  
  double tot = 0.0;
  for(int i=0; i < N; ++i)
  {
    for(int j=0; j < N; ++j)
    { tot += A; }
  }
    
  cout << "tot " << tot << endl;
  
  
  //Finder
  std::set<double> vect;
  vect.insert(2.0);  
}

0 Kudos
3 Replies
Andrea_V_
Beginner
327 Views

I appreciate the disclaimer but I am using a Intel Xeon Cpu

0 Kudos
Shenghong_G_Intel
327 Views

Hi Andrea,

I can reproduce the issue on a SNB machine with latest 15.0 update 1 compiler. This looks interesting as the code looks like to be unrelated but affect the performance of above code!!!

I'll take a closer look and track it in our problem system.

Thanks,

Shenghong 

0 Kudos
Shenghong_G_Intel
327 Views

Hi Andrea,

Checking the asm generated, I can see, for the A = C; loop:

ICC with std::set code (30s): it will invoke fast_memcpy.

ICC without std::set code (16s): it will use the assembly of movq (scalar instruction)

G++ with/without std::set code (16s): it will use movsq (vector instruction)

It should be similar for  C += A * Bloop.

At the beginning, I thought this single line will affect the stack layout, which may affect optimizer, but if I replace the insert function with something like size(), it will be faster. Also, if I define the vect as global varialbe (which will not be on stack), it is same...which means, this looks like to be an issue related to the "insert" function call.

I have no idea why this unrelated std::set line will affect the vectorization of above loop, but definitely, this is a bug of ICC's optimization, so, I will leave more investigation and explanation from dev team. I will update you if there are any news from dev team.

Thanks,

Shenghong

0 Kudos
Reply