Intel compiler (for Windows, Professional v. 11.1.035) doesn't paralize loop:
for (long i=2;i
while it paralizes more complicated loop:
for (long i=2;i
where k - is a number, which I put in from keyboard.
f is simple calculated for long time function without side effects
Why compiler parallizes complex loop, but doesn't - simple one?
Link Copied
Intel compiler (for Windows, Professional v. 11.1.035) doesn't paralize loop:
for (long i=2;i
while it paralizes more complicated loop:
for (long i=2;i
where k - is a number, which I put in from keyboard.
f is simple calculated for long time function without side effects
Why compiler parallizes complex loop, but doesn't - simple one?
[cpp]>type bug.c int f(long i) { return i; } foo(long N, long k, double *x) { long i; for (i=2;iI'd be curious to see a compilable test case where this loop was parallelized. If I change the '2' to a 'k' then it also fails to parallelize because of a number of possible dependences (could be flow or anti, depending on the sign of 'k').=x[i-2]+f(i); } } >icl -c -Qparallel bug.c -Qpar-report3 Intel C++ Compiler Professional for applications running on Intel 64, Version 11.1 Build 20091012 Package ID: w_cproc_p_11.1.051 Copyright (C) 1985-2009 Intel Corporation. All rights reserved. bug.c procedure: f procedure: f procedure: foo procedure: foo bug.c(10): (col. 5) remark: loop was not parallelized: existence of parallel dependence. bug.c(11): (col. 9) remark: parallel dependence: assumed FLOW dependence between x line 11 and x line 11. >[/cpp]
for (i=1;i
These loops are independent, and compiler runs them parallel!
For the case of arbitrary k:
for (i=0;i
we will get k loops with i, such that i%k=0,i%k=1, i%k=2, i%k=3, ...,i%k=k-1
The question is that: why Intel compiler treats complex loop with arbitrary k, and doesn't - simple one with k=2?
Compiler options:
/c /O2 /Og /Ot /Qip /D "WIN32" /D "NDEBUG" /D "_CONSOLE" /D "_UNICODE" /D "UNICODE" /EHsc /MT /GS /arch:SSE /fp:fast /FAs /Fa"Release/" /Fo"Release/" /W3 /nologo /Wp64 /Zi /Qopenmp /Qparallel
I examine asm-file. It is quite complex. To understand asm-file better, I decided to change
x=x[i-k]+f(i);
by
x=log10(abs(x[i-k]+f(i)));
The result was the same: the above example is fast. But when I putx[i-2] instead ofx[i-k] program become twice slow on my Dual core processor.
The asm-files for slow program with x[i-2] - seq_loop.asm and with x[i-k] - parallel_loop.asm. From asm-file I took only the interesting loop:
for (long i=5;i
Compiler pasted the code of function f in this loop. So in asm-files there is also loop:
for (long j=0;j
As you can see from asm code in program withx[i-2] neither first or second loop is parallel.
But in program withx[i-k] - compiler runs the loop
for (long j=0;j
parallel (it is surrounded by "call ___kmpc_serialized_parallel" and "call ___kmpc_end_serialized_parallel")
It is suprising for me. I thought, it runs loop
for (long i=5;i
parallel. I've checked and discovered, that Intel compiler can't parallise recurrent loops :-(
The new question: why does the compiler not run parrallel the loop
for (long j=0;j
when I put x[i-2] instead ofx[i-k]?
Compiler options:
/c /O2 /Og /Ot /Qip /D "WIN32" /D "NDEBUG" /D "_CONSOLE" /D "_UNICODE" /D "UNICODE" /EHsc /MT /GS /arch:SSE /fp:fast /FAs /Fa"Release/" /Fo"Release/" /W3 /nologo /Wp64 /Zi /Qopenmp /Qparallel
I examine asm-file. It is quite complex. To understand asm-file better, I decided to change
x=x[i-k]+f(i);
by
x=log10(abs(x[i-k]+f(i)));
The result was the same: the above example is fast. But when I putx[i-2] instead ofx[i-k] program become twice slow on my Dual core processor.
The asm-files for slow program with x[i-2] - seq_loop.asm and with x[i-k] - parallel_loop.asm. From asm-file I took only the interesting loop:
for (long i=5;i
Compiler pasted the code of function f in this loop. So in asm-files there is also loop:
for (long j=0;j
As you can see from asm code in program withx[i-2] neither first or second loop is parallel.
But in program withx[i-k] - compiler runs the loop
for (long j=0;j
parallel (it is surrounded by "call ___kmpc_serialized_parallel" and "call ___kmpc_end_serialized_parallel")
It is suprising for me. I thought, it runs loop
for (long i=5;i
parallel. I've checked and discovered, that Intel compiler can't parallise recurrent loops :-(
The new question: why does the compiler not run parrallel the loop
for (long j=0;j
when I put x[i-2] instead ofx[i-k]?
For more complete information about compiler optimizations, see our Optimization Notice.