Intel® C++ Compiler
Community support and assistance for creating C++ code that runs on platforms based on Intel® processors.
Announcements
The Intel sign-in experience has changed to support enhanced security controls. If you sign in, click here for more information.
7782 Discussions

Intel compiler doesn't paralize the loop

gudasergey
Beginner
252 Views

Intel compiler (for Windows, Professional v. 11.1.035) doesn't paralize loop:

for (long i=2;i=x[i-2]+f(i);

while it paralizes more complicated loop:

for (long i=2;i=x[i-k]+f(i);

where k - is a number, which I put in from keyboard.

f is simple calculated for long time function without side effects

Why compiler parallizes complex loop, but doesn't - simple one?

0 Kudos
6 Replies
Dale_S_Intel
Employee
252 Views
Quoting - gudasergey

Intel compiler (for Windows, Professional v. 11.1.035) doesn't paralize loop:

for (long i=2;i=x[i-2]+f(i);

while it paralizes more complicated loop:

for (long i=2;i=x[i-k]+f(i);

where k - is a number, which I put in from keyboard.

f is simple calculated for long time function without side effects

Why compiler parallizes complex loop, but doesn't - simple one?


As always, it would be helpful to have a complete compilable test case to make sure we're all on the same page. In particular I'd be surprised if either loop would be parallelized because there is a cross iteration dependence between x and x[i-2]:

[cpp]>type bug.c
int f(long i)
{
     return i;
}

foo(long N, long k, double *x)
{
    long i;

    for (i=2;i=x[i-2]+f(i);
    }
}

>icl -c -Qparallel bug.c -Qpar-report3
Intel C++ Compiler Professional for applications running on Intel 64, Version 11.1    Build 20091012 Package ID: w_cproc_p_11.1.051
Copyright (C) 1985-2009 Intel Corporation.  All rights reserved.

bug.c
   procedure: f
   procedure: f
   procedure: foo
   procedure: foo
bug.c(10): (col. 5) remark: loop was not parallelized: existence of parallel dependence.
bug.c(11): (col. 9) remark: parallel dependence: assumed FLOW dependence between x line 11 and x line 11.

>[/cpp]
I'd be curious to see a compilable test case where this loop was parallelized. If I change the '2' to a 'k' then it also fails to parallelize because of a number of possible dependences (could be flow or anti, depending on the sign of 'k').

Dale


gudasergey
Beginner
252 Views
It is a well known trick - "parallelization of recurrent loops".
Let I explain. For simplicity suppose a loop:
for (i=0;i=g(x[i-1]),i);
where g is some function. Intel compiler makes substitution:x=g(x[i-1]),i)=g(g(x[i-2],i-1),i)) and consider 2 loops: for even and odd i
for (i=0;i=g(g(x[i-2],i-1),i));

for (i=1;i=g(g(x[i-2],i-1),i));

These loops are independent, and compiler runs them parallel!

For the case of arbitrary k:
for (i=0;i=g(x[i-k]),i);
we will get k loops with i, such that i%k=0,i%k=1, i%k=2, i%k=3, ...,i%k=k-1

The question is that: why Intel compiler treats complex loop with arbitrary k, and doesn't - simple one with k=2?

gudasergey
Beginner
252 Views
I run the program:

#include
using namespace std;
#include
#include
const long n=30000, m=2000;

double f(long i)
{
double s=0;
for (long j=0;j s+=j*sin((double)i*j)*exp(cos((double)j));
return cos(s);
}
int main()
{
clock_t start, finish;
double duration;
double s,x;
long N; long k;
cout<<"Enter number k ";
cin>>k;
for (long i=0;i=sin((double)i+k-i*k*i);
start = clock();
for (long i=5;i=x[i-2]+f(i);
finish = clock();
s=0;
for(long i=0;i);
duration = (double)(finish - start) / CLOCKS_PER_SEC;
printf( "%2.10f secondsnSum: %2.10fn", duration,s);

char ccc;
std::cin>>ccc;
return 0;
}
I measured time and watch processor load in task manager. In above example the processor load was from 50% to 60%. But if you changex[i-2] onx[i-k] (and enter k=2) then processor load becomes 100% and measured time decreases twice. (I have Dual core processor)
Dale_S_Intel
Employee
252 Views
Quoting - gudasergey
I run the program:

#include
using namespace std;
#include
#include
const long n=30000, m=2000;

double f(long i)
{
double s=0;
for (long j=0;j s+=j*sin((double)i*j)*exp(cos((double)j));
return cos(s);
}
int main()
{
clock_t start, finish;
double duration;
double s,x;
long N; long k;
cout<<"Enter number k ";
cin>>k;
for (long i=0;i=sin((double)i+k-i*k*i);
start = clock();
for (long i=5;i=x[i-2]+f(i);
finish = clock();
s=0;
for(long i=0;i);
duration = (double)(finish - start) / CLOCKS_PER_SEC;
printf( "%2.10f secondsnSum: %2.10fn", duration,s);

char ccc;
std::cin>>ccc;
return 0;
}
I measured time and watch processor load in task manager. In above example the processor load was from 50% to 60%. But if you changex[i-2] onx[i-k] (and enter k=2) then processor load becomes 100% and measured time decreases twice. (I have Dual core processor)

What options did you use to compile this file? When I build it with "-parallel", while several loops are parallelized (including the one in f()), the loop in question (with either x-2 or x-k) is not parallelized. Perhaps the actual question is why does it goe faster when you change "x-2" to "x-k"? Or do you have reason to believe (e.g. par-report, examination of asm file) that it actually parallelized the loop in question?

Dale

gudasergey
Beginner
252 Views

What options did you use to compile this file? When I build it with "-parallel", while several loops are parallelized (including the one in f()), the loop in question (with either x-2 or x-k) is not parallelized. Perhaps the actual question is why does it goe faster when you change "x-2" to "x-k"? Or do you have reason to believe (e.g. par-report, examination of asm file) that it actually parallelized the loop in question?

Dale

Compiler options:

/c /O2 /Og /Ot /Qip /D "WIN32" /D "NDEBUG" /D "_CONSOLE" /D "_UNICODE" /D "UNICODE" /EHsc /MT /GS /arch:SSE /fp:fast /FAs /Fa"Release/" /Fo"Release/" /W3 /nologo /Wp64 /Zi /Qopenmp /Qparallel

I examine asm-file. It is quite complex. To understand asm-file better, I decided to change
x=x[i-k]+f(i);
by
x=log10(abs(x[i-k]+f(i)));
The result was the same: the above example is fast. But when I putx[i-2] instead ofx[i-k] program become twice slow on my Dual core processor.

The asm-files for slow program with x[i-2] - seq_loop.asm and with x[i-k] - parallel_loop.asm. From asm-file I took only the interesting loop:
for (long i=5;i=log10(abs(x[i-k]+f(i)));
Compiler pasted the code of function f in this loop. So in asm-files there is also loop:
for (long j=0;j s+=j*sin((double)i*j)*exp(cos((double)j));

As you can see from asm code in program withx[i-2] neither first or second loop is parallel.
But in program withx[i-k] - compiler runs the loop
for (long j=0;j s+=j*sin((double)i*j)*exp(cos((double)j));
parallel (it is surrounded by "call ___kmpc_serialized_parallel" and "call ___kmpc_end_serialized_parallel")
It is suprising for me. I thought, it runs loop
for (long i=5;i=log10(abs(x[i-k]+f(i)));
parallel. I've checked and discovered, that Intel compiler can't parallise recurrent loops :-(
The new question: why does the compiler not run parrallel the loop
for (long j=0;j s+=j*sin((double)i*j)*exp(cos((double)j));
when I put x[i-2] instead ofx[i-k]?

Dale_S_Intel
Employee
252 Views
Quoting - gudasergey

Compiler options:

/c /O2 /Og /Ot /Qip /D "WIN32" /D "NDEBUG" /D "_CONSOLE" /D "_UNICODE" /D "UNICODE" /EHsc /MT /GS /arch:SSE /fp:fast /FAs /Fa"Release/" /Fo"Release/" /W3 /nologo /Wp64 /Zi /Qopenmp /Qparallel

I examine asm-file. It is quite complex. To understand asm-file better, I decided to change
x=x[i-k]+f(i);
by
x=log10(abs(x[i-k]+f(i)));
The result was the same: the above example is fast. But when I putx[i-2] instead ofx[i-k] program become twice slow on my Dual core processor.

The asm-files for slow program with x[i-2] - seq_loop.asm and with x[i-k] - parallel_loop.asm. From asm-file I took only the interesting loop:
for (long i=5;i=log10(abs(x[i-k]+f(i)));
Compiler pasted the code of function f in this loop. So in asm-files there is also loop:
for (long j=0;j s+=j*sin((double)i*j)*exp(cos((double)j));

As you can see from asm code in program withx[i-2] neither first or second loop is parallel.
But in program withx[i-k] - compiler runs the loop
for (long j=0;j s+=j*sin((double)i*j)*exp(cos((double)j));
parallel (it is surrounded by "call ___kmpc_serialized_parallel" and "call ___kmpc_end_serialized_parallel")
It is suprising for me. I thought, it runs loop
for (long i=5;i=log10(abs(x[i-k]+f(i)));
parallel. I've checked and discovered, that Intel compiler can't parallise recurrent loops :-(
The new question: why does the compiler not run parrallel the loop
for (long j=0;j s+=j*sin((double)i*j)*exp(cos((double)j));
when I put x[i-2] instead ofx[i-k]?

It is an interesting question. I'll look into it in a little more detail and see what I can figure out.

Dale

Reply