Intel compiler doesn't paralize the loop

gudasergey · ‎01-12-2010

Intel compiler (for Windows, Professional v. 11.1.035) doesn't paralize loop:

for (long i=2;i=x[i-2]+f(i);

while it paralizes more complicated loop:

for (long i=2;i=x[i-k]+f(i);

where k - is a number, which I put in from keyboard.

f is simple calculated for long time function without side effects

Why compiler parallizes complex loop, but doesn't - simple one?

Dale_S_Intel · ‎01-12-2010

Quoting - gudasergey

Intel compiler (for Windows, Professional v. 11.1.035) doesn't paralize loop:

for (long i=2;i=x[i-2]+f(i);

while it paralizes more complicated loop:

for (long i=2;i=x[i-k]+f(i);

where k - is a number, which I put in from keyboard.

f is simple calculated for long time function without side effects

Why compiler parallizes complex loop, but doesn't - simple one?

As always, it would be helpful to have a complete compilable test case to make sure we're all on the same page. In particular I'd be surprised if either loop would be parallelized because there is a cross iteration dependence between x and x[i-2]:

[cpp]>type bug.c
int f(long i)
{
     return i;
}

foo(long N, long k, double *x)
{
    long i;

    for (i=2;i=x[i-2]+f(i);
    }
}

>icl -c -Qparallel bug.c -Qpar-report3
Intel C++ Compiler Professional for applications running on Intel 64, Version 11.1    Build 20091012 Package ID: w_cproc_p_11.1.051
Copyright (C) 1985-2009 Intel Corporation.  All rights reserved.

bug.c
   procedure: f
   procedure: f
   procedure: foo
   procedure: foo
bug.c(10): (col. 5) remark: loop was not parallelized: existence of parallel dependence.
bug.c(11): (col. 9) remark: parallel dependence: assumed FLOW dependence between x line 11 and x line 11.

>[/cpp]

I'd be curious to see a compilable test case where this loop was parallelized. If I change the '2' to a 'k' then it also fails to parallelize because of a number of possible dependences (could be flow or anti, depending on the sign of 'k').

Dale

gudasergey · ‎01-12-2010

It is a well known trick - "parallelization of recurrent loops".

Let I explain. For simplicity suppose a loop:

for (i=0;i=g(x[i-1]),i);

where g is some function. Intel compiler makes substitution:x=g(x[i-1]),i)=g(g(x[i-2],i-1),i)) and consider 2 loops: for even and odd i

for (i=0;i=g(g(x[i-2],i-1),i));

for (i=1;i=g(g(x[i-2],i-1),i));

These loops are independent, and compiler runs them parallel!

For the case of arbitrary k:
for (i=0;i=g(x[i-k]),i);
we will get k loops with i, such that i%k=0,i%k=1, i%k=2, i%k=3, ...,i%k=k-1

The question is that: why Intel compiler treats complex loop with arbitrary k, and doesn't - simple one with k=2?

gudasergey · ‎01-12-2010

I run the program:

#include
using namespace std;
#include
#include
const long n=30000, m=2000;

double f(long i)
{
double s=0;
for (long j=0;j s+=j*sin((double)i*j)*exp(cos((double)j));
return cos(s);
}
int main()
{
clock_t start, finish;
double duration;
double s,x;
long N; long k;
cout<<"Enter number k ";
cin>>k;
for (long i=0;i=sin((double)i+k-i*k*i);
start = clock();
for (long i=5;i=x[i-2]+f(i);
finish = clock();
s=0;
for(long i=0;i);
duration = (double)(finish - start) / CLOCKS_PER_SEC;
printf( "%2.10f secondsnSum: %2.10fn", duration,s);

char ccc;
std::cin>>ccc;
return 0;
}

I measured time and watch processor load in task manager. In above example the processor load was from 50% to 60%. But if you changex[i-2] onx[i-k] (and enter k=2) then processor load becomes 100% and measured time decreases twice. (I have Dual core processor)

Dale_S_Intel · ‎01-12-2010

Quoting - gudasergey

I run the program:

#include
using namespace std;
#include
#include
const long n=30000, m=2000;

double f(long i)
{
double s=0;
for (long j=0;j s+=j*sin((double)i*j)*exp(cos((double)j));
return cos(s);
}
int main()
{
clock_t start, finish;
double duration;
double s,x;
long N; long k;
cout<<"Enter number k ";
cin>>k;
for (long i=0;i=sin((double)i+k-i*k*i);
start = clock();
for (long i=5;i=x[i-2]+f(i);
finish = clock();
s=0;
for(long i=0;i);
duration = (double)(finish - start) / CLOCKS_PER_SEC;
printf( "%2.10f secondsnSum: %2.10fn", duration,s);

char ccc;
std::cin>>ccc;
return 0;
}

I measured time and watch processor load in task manager. In above example the processor load was from 50% to 60%. But if you changex[i-2] onx[i-k] (and enter k=2) then processor load becomes 100% and measured time decreases twice. (I have Dual core processor)

What options did you use to compile this file? When I build it with "-parallel", while several loops are parallelized (including the one in f()), the loop in question (with either x-2 or x-k) is not parallelized. Perhaps the actual question is why does it goe faster when you change "x-2" to "x-k"? Or do you have reason to believe (e.g. par-report, examination of asm file) that it actually parallelized the loop in question?

Dale

gudasergey · ‎01-12-2010

Quoting - Dale Schouten (Intel)

What options did you use to compile this file? When I build it with "-parallel", while several loops are parallelized (including the one in f()), the loop in question (with either x-2 or x-k) is not parallelized. Perhaps the actual question is why does it goe faster when you change "x-2" to "x-k"? Or do you have reason to believe (e.g. par-report, examination of asm file) that it actually parallelized the loop in question?

Dale

Compiler options:

/c /O2 /Og /Ot /Qip /D "WIN32" /D "NDEBUG" /D "_CONSOLE" /D "_UNICODE" /D "UNICODE" /EHsc /MT /GS /arch:SSE /fp:fast /FAs /Fa"Release/" /Fo"Release/" /W3 /nologo /Wp64 /Zi /Qopenmp /Qparallel

I examine asm-file. It is quite complex. To understand asm-file better, I decided to change
x=x[i-k]+f(i);
by
x=log10(abs(x[i-k]+f(i)));
The result was the same: the above example is fast. But when I putx[i-2] instead ofx[i-k] program become twice slow on my Dual core processor.

The asm-files for slow program with x[i-2] - seq_loop.asm and with x[i-k] - parallel_loop.asm. From asm-file I took only the interesting loop:
for (long i=5;i=log10(abs(x[i-k]+f(i)));
Compiler pasted the code of function f in this loop. So in asm-files there is also loop:
for (long j=0;j s+=j*sin((double)i*j)*exp(cos((double)j));

As you can see from asm code in program withx[i-2] neither first or second loop is parallel.
But in program withx[i-k] - compiler runs the loop
for (long j=0;j s+=j*sin((double)i*j)*exp(cos((double)j));
parallel (it is surrounded by "call ___kmpc_serialized_parallel" and "call ___kmpc_end_serialized_parallel")
It is suprising for me. I thought, it runs loop
for (long i=5;i=log10(abs(x[i-k]+f(i)));
parallel. I've checked and discovered, that Intel compiler can't parallise recurrent loops :-(
The new question: why does the compiler not run parrallel the loop
for (long j=0;j s+=j*sin((double)i*j)*exp(cos((double)j));
when I put x[i-2] instead ofx[i-k]?

Dale_S_Intel · ‎01-13-2010

Quoting - gudasergey

Compiler options:

/c /O2 /Og /Ot /Qip /D "WIN32" /D "NDEBUG" /D "_CONSOLE" /D "_UNICODE" /D "UNICODE" /EHsc /MT /GS /arch:SSE /fp:fast /FAs /Fa"Release/" /Fo"Release/" /W3 /nologo /Wp64 /Zi /Qopenmp /Qparallel

I examine asm-file. It is quite complex. To understand asm-file better, I decided to change
x=x[i-k]+f(i);
by
x=log10(abs(x[i-k]+f(i)));
The result was the same: the above example is fast. But when I putx[i-2] instead ofx[i-k] program become twice slow on my Dual core processor.

The asm-files for slow program with x[i-2] - seq_loop.asm and with x[i-k] - parallel_loop.asm. From asm-file I took only the interesting loop:
for (long i=5;i=log10(abs(x[i-k]+f(i)));
Compiler pasted the code of function f in this loop. So in asm-files there is also loop:
for (long j=0;j s+=j*sin((double)i*j)*exp(cos((double)j));

As you can see from asm code in program withx[i-2] neither first or second loop is parallel.
But in program withx[i-k] - compiler runs the loop
for (long j=0;j s+=j*sin((double)i*j)*exp(cos((double)j));
parallel (it is surrounded by "call ___kmpc_serialized_parallel" and "call ___kmpc_end_serialized_parallel")
It is suprising for me. I thought, it runs loop
for (long i=5;i=log10(abs(x[i-k]+f(i)));
parallel. I've checked and discovered, that Intel compiler can't parallise recurrent loops :-(
The new question: why does the compiler not run parrallel the loop
for (long j=0;j s+=j*sin((double)i*j)*exp(cos((double)j));
when I put x[i-2] instead ofx[i-k]?

It is an interesting question. I'll look into it in a little more detail and see what I can figure out.

Dale