- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I'm currently experiencing some performance problems using the icpcp compiler in the composer_xe_2015 suite.
I attach the very simple driver I have used below: it has been compiled with the -O3 flag, the operating system is Suse 2011 SP1
and the computer is a single node of a intel cluster with two soket Xeon X5675.
The driver takes nearly 33 seconds to run, if I remove the last two rows
std::set<double> vect;
vect.insert(2.0);
then I get 16 seconds. That seems strange. I have also built the code using gcc and I get 16 seconds: the g++ compiler is 4.3.4.
What am I missing? If I reduce the optimization level I do not get any improvement.
Thanks.
Andrea
#include <cmath>
#include <iostream>
#include <time.h>
#include <vector>
#include <set>
using namespace std;
int main(int argc, char *argv[])
{
//Matmul---------------------------------------------------------------------
static int N = 300;
vector< vector<double> > A(N), B(N), C(N);
time_t start, end;
for(int i=0; i < N; ++i)
{
A.resize(N);
B.resize(N);
C.resize(N);
}
for(int i=0; i < N; ++i)
{
for(int j=0; j < N; ++j)
{
A
B
}
}
cout << "Start "; time(&start); cout << endl;
for(int z=0; z < 300; ++z)
{
for(int i=0; i < N; ++i)
{
for(int j=0; j < N; ++j)
{
C
for(int k=0; k < N; ++k)
{ C
}
}
for(int i=0; i < N; ++i)
{
for(int j=0; j < N; ++j)
{ A
}
}
time(&end); cout << "done (" << difftime(end, start) << " s)" << endl << endl;
double tot = 0.0;
for(int i=0; i < N; ++i)
{
for(int j=0; j < N; ++j)
{ tot += A
}
cout << "tot " << tot << endl;
//Finder
std::set<double> vect;
vect.insert(2.0);
}
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I appreciate the disclaimer but I am using a Intel Xeon Cpu
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Andrea,
I can reproduce the issue on a SNB machine with latest 15.0 update 1 compiler. This looks interesting as the code looks like to be unrelated but affect the performance of above code!!!
I'll take a closer look and track it in our problem system.
Thanks,
Shenghong
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Andrea,
Checking the asm generated, I can see, for the A
ICC with std::set code (30s): it will invoke fast_memcpy.
ICC without std::set code (16s): it will use the assembly of movq (scalar instruction)
G++ with/without std::set code (16s): it will use movsq (vector instruction)
It should be similar for C
At the beginning, I thought this single line will affect the stack layout, which may affect optimizer, but if I replace the insert function with something like size(), it will be faster. Also, if I define the vect as global varialbe (which will not be on stack), it is same...which means, this looks like to be an issue related to the "insert" function call.
I have no idea why this unrelated std::set line will affect the vectorization of above loop, but definitely, this is a bug of ICC's optimization, so, I will leave more investigation and explanation from dev team. I will update you if there are any news from dev team.
Thanks,
Shenghong

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page