Re: OpenMP & KAP preprocessor

denaro · ‎10-25-2004

Hi all,
I was wondering what happened to KAP products since Intel acquired KAI. Does still exist the kf90 for Intel CPUs'?

I have my own code of CFD that, owing to my old KAP preprocessor working on a Compaq Alpha, is written with OpenMP directive. On the two-processors Alpha I have very good speed-up (almost 1.8) bu when I compile the code on a two-processor Xeon with the options:

-fpp -tpp7 -xN -axN -O3 -ipo -align -openmp

I get the parallel code running slower than the sequential !! Should I suppose that the code from KAP on Alpha does not meet the requirement for Intel Cpu?

This is my installation envirenment for the Xeon-based machine:

...........................
[les@venere provasor3d]$ uname -a
Linux venere 2.6.3-7mdk-p3-smp-64GB #1 SMP Wed Mar 17 15:34:39 CET 2004 i686 unknown unknown GNU/Linux
[les@venere provasor3d]$

...........................

[les@venere provasor3d]$ ifort -V
Intel Fortran Compiler for 32-bit applications, Version 8.0 Build 20031016Z Package ID: l_fc_p_8.0.035
Copyright (C) 1985-2003 Intel Corporation. All rights reserved.
...........................

please, anyone experienced something similar?
Thanks

Henry_G_Intel · ‎10-26-2004

Hello,
If your OpenMP code had a 1.8 speedup on a dual-Alpha, I would expect to see good speedup on a dual-Xeon. The -xN optimization could be interfering with parallelization. Try removing the -xN and -axN options (by the way, the -x and -ax options are mutually exclusive) and see if that helps parallelism at the expense of serial performance. Either way, I suggest contacting Intel Premier Support for tech support questions like this.

The Intel compilers print a message whenever a loop is vectorized or parallelized. What message does the compiler print for your source file when you compile with both -xN and -openmp?

The KAI KAP products are no longer for sale. However, the Intel compilers use the same OpenMP implementation as the old KAP products. In fact, the library is still called Guide.

Best regards,
Henry

denaro · ‎10-26-2004

Dear Henry,
thanks for the response I will try all tests without the options you have indicated and let you know the results.

What is real strange is that the ifort with the option -parallel does not recognize as parallelizable the same loops (written in the original serial fortran code) that the KAP automatically parallelizes on the Alpha ... Therefore, I used the transformed sources (the *.cmp.f files) provided by KAP and containing the OpenMP directive. I compiled them with the -openmp options and ifort thereafter recognize the parallel region. However the execution times are greater than those of the sequential run...
It is possible to obtain, with some ifort options, the transformed sources with OpenMP directives ? I could compare the two sources.
sincerely
Filippo

TimP · ‎10-26-2004

I'm curious about your notation that you are running the 64GB kernel. If you are needing that in order to run parallel, maybe the overhead of the PAE addressing is overcoming the benefit of parallelization. I have no experience with that, as usually we don't expect any chance of good 32-bit performance with more than a 4GB kernel.

Henry_G_Intel · ‎10-27-2004

Hi Filippo,

Unfortunately, there's no compiler option to output the transformed code from automatic parallelization.

Henry

denaro · ‎10-27-2004

Hi,
I see too, I think Intel should add this option ... everyone working on parallel code would see the effects of the transformed source and possibly working on it.

However, I made some other tests without the xN option and the performances actually improved. The -O3 optimization does not influence the execution. Anyway, the performances are still far from what I get on the Alpha ... some of the differences should be due to the 64-bit OS ... but I supposed that Xeon could be better ..

Best Regards
Filippo

denaro · ‎10-28-2004

Dear all

after some work I have completed several tests about execution time on Alpha and Xeon two-processor. I hope someone could be interested in reading the results and give me some suggestions.

Rather than on all my code, I concentrated all tests on a simple iterative solver for linear system derived from finite difference discretization of a three-dimensional elliptic equation (hope someone has experience with this). I attach the tar file with the fortran sources and the .ini file, in case someone want to repeat my experiences.
Again, performances of Intel compiler is someway strange.. it can not parallelize what I am sure must be parallelized and KAP actually does!
Following are the reports of the tests organized for Alpha and Xeon.
I hope some is interested in this topic ...
With Regards
Filippo

#################### XEON ###################
_____________________________________________________________________________________

--- Execution times on Xeon two-processors.
The original sources (i.e. the *.f files) are compiled with ifort and auto-parallelizer:

[les@venere provasor3d]$ make
ifort -fpp -tpp7 -O3 -ipo -align -parallel -par_threshold0 -c -I/dati/provasor3d/ calcphi3d.f
ifort -I/dati/provasor3d/ -fpp -tpp7 -O3 -ipo -align -parallel -par_threshold0 provasor3d.f -o provasor3d
calcphi3d.o
IPO: using IR for /home/les/tmp/ifortUPvFYg.o
IPO: using IR for calcphi3d.o
IPO: performing multi-file optimizations
provasor3d.f(127) : (col. 2) remark: LOOP WAS AUTO-PARALLELIZED.
provasor3d.f(127) : (col. 26) remark: LOOP WAS AUTO-PARALLELIZED.
[les@venere provasor3d]$

1) sequential run,
by means of variable nsc=1 in the file sor.ini, it is called the sequential subroutine 2
that compiler ifort is unable to parallelize (the same as KAP due to dependendances), i.e. there are no
parallel directive.

[les@venere provasor3d]$ time -p provasor3d
nx , ny , nz = 40 40 40
n , m , l = 41 42 41
1 - SOR sequenziale con coef.matrice scalari;
2 - SOR sequenziale con coef.matrice vettori;
3 - SOR parallelo (i,j - b/r) (k- tutti) - coef.scalari
4 - SOR parallelo (i,j - b/r) (k- tutti) - coef.vettori
5 - SOR parallelo (i,j - b/r) (k- zebra) - coef.scalari
6 - SOR parallelo (i,j - b/r) (k- zebra) - coef.vettori
Si sceglie: 1
SOR sequenziale - coef. scalari x, z
Chiamata routine 2
N. iterazioni per convergenza ellittica: 4847
real 7.93
user 7.68
sys 0.24

2) parallel run,
by means of variable nsc=5 in the file sor.ini, it is called the subroutine 9
that compiler KAP was able to parallelize (due to white/black-red coloring) but ifort CAN NOT!

[les@venere provasor3d]$ time -p provasor3d
nx , ny , nz = 40 40 40
n , m , l = 41 42 41
1 - SOR sequenziale con coef.matrice scalari;
2 - SOR sequenziale con coef.matrice vettori;
3 - SOR parallelo (i,j - b/r) (k- tutti) - coef.scalari
4 - SOR parallelo (i,j - b/r) (k- tutti) - coef.vettori
5 - SOR parallelo (i,j - b/r) (k- zebra) - coef.scalari
6 - SOR parallelo (i,j - b/r) (k- zebra) - coef.vettori
Si sceglie: 5
SOR parallelo (i,j - b/r) (k-zebra) -coef.scalari x,y
Chiamata routine 9
N. iterazioni per convergen za ellittica: 4845
real 9.54
user 9.30
sys 0.24

__________________________________________________________________________________________

--- Execution times on Xeon two-processors.
The sources generated from KAP (i.e. the *.cmp.f files) are compiled with ifort without
auto-parallelizer but with the -openmp option:

[les@venere provasor3d]$ make
ifort -fpp -tpp7 -O3 -ipo -align -openmp -openmp -c -I/dati/provasor3d/ calcphi3d.cmp.f
ifort -I/dati/provasor3d/ -fpp -tpp7 -O3 -ipo -align -openmp provasor3d.cmp.f -o provasor3d
calcphi3d.cmp.o
IPO: using IR for /home/les/tmp/ifortMhqYPF.o
IPO: using IR for calcphi3d.cmp.o
IPO: performing multi-file optimizations
provasor3d.cmp.f(159) : (col. 6) remark: OpenMP DEFINED LOOP WAS PARALLELIZED.
provasor3d.cmp.f(156) : (col. 6) remark: OpenMP DEFINED REGION WAS PARALLELIZED.
provasor3d.cmp.f(234) : (col. 6) remark: OpenMP DEFINED LOOP WAS PARALLELIZED.
provasor3d.cmp.f(231) : (col. 6) remark: OpenMP DEFINED REGION WAS PARALLELIZED.
provasor3d.cmp.f(332) : (col. 6) remark: OpenMP DEFINED LOOP WAS PARALLELIZED.
provasor3d.cmp.f(331) : (col. 6) remark: OpenMP DEFINED REGION WAS PARALLELIZED.
provasor3d.cmp.f(415) : (col. 6) remark: OpenMP DEFINED LOOP WAS PARALLELIZED.
provasor3d.cmp.f(413) : (col. 6) remark: OpenMP DEFINED REGION WAS PARALLELIZED.
provasor3d.cmp.f(442) : (col. 6) remark: OpenMP DEFINED LOOP WAS PARALLELIZED.
provasor3d.cmp.f(440) : (col. 6) remark: OpenMP DEFINED REGION WAS PARALLELIZED.
provasor3d.cmp.f(468) : (col. 6) remark: OpenMP DEFINED LOOP WAS PARALLELIZED.
provasor3d.cmp.f(466) : (col. 6) remark: OpenMP DEFINED REGION WAS PARALLELIZED.
calcphi3d.cmp.f(4814) : (col. 6) remark: OpenMP DEFINED LOOP WAS PARALLELIZED.
calcphi3d.cmp.f(4808) : (col. 6) remark: OpenMP DEFINED REGION WAS PARALLELIZED.
... truncated for the sake of brevity ...
... .... ....
calcphi3d.cmp.f(4045) : (col. 6) remark: OpenMP DEFINED REGION WAS PARALLELIZED.
[les@venere provasor3d]$

3) same calling as 1)
by means of variable nsc=1 in the file sor.ini, it is called the subroutine 2 that is sequential

[les@venere provasor3d]$ time -p provasor3d
nx , ny , nz = 40 40 40
n , m , l = 41 42 41
1 - SOR sequenziale con coef.matrice scalari;
2 - SOR sequenziale con coef.matrice vettori;
3 - SOR parallelo (i,j - b/r) (k- tutti) - coef.scalari
4 - SOR parallelo (i,j - b/r) (k- tutti) - coef.vettori
5 - SOR parallelo (i,j - b/r) (k- zebra) - coef.scalari
6 - SOR parallelo (i,j - b/r) (k- zebra) - coef.vettori
Si sceglie: 1
SOR sequenziale - coef. scalari x, z
Chiamata routine 2
N. iterazioni per convergenza ellittica: 4847
real 7.14
user 7.07
sys 0.23

4) same calling as 2)
by means of variable nsc=5 in the file sor.ini, it is called the subroutine 9
with the white/black-red couloring and OpenMp directives

[les@venere provasor3d]$ time -p provasor3d
nx , ny , nz = 40 40 40
n , m , l = 41 42 41
1 - SOR sequenziale con coef.matrice scalari;
2 - SOR sequenziale con coef.matrice vettori;
3 - SOR parallelo (i,j - b/r) (k- tutti) - coef.scalari
4 - SOR parallelo (i,j - b/r) (k- tutti) - coef.vettori
5 - SOR parallelo (i,j - b/r) (k- zebra) - coef.scalari
6 - SOR parallelo (i,j - b/r) (k- zebra) - coef.vettori
Si sceglie: 5
SOR parallelo (i,j - b/r) (k-zebra) -coef.scalari x,y
Chiamata routine 9
N. iterazioni per convergenza ellittica: 4845
real 6.43
user 12.20
sys 0.33

################# Alpha ###############
_____________________________________________________________________________________

--- Execution times on Alpha two-processors compiled with KAP and auto-parallelizer:

kf90 -I/utenti/denaro/provasor3d/ -fkapargs='-conc' -O5 -omp -fast -tune ev6 -a
rch host -assume nounderscore provasor3d.f -o provasor3d calcphi3d.o
KAP/Tru64_U_F90 4.4 k340504 20010517 28-Oct-2004 11:18:27
KAP/Tru64_U_F90 4.4 k340504 20010517 : 0 errors in file provasor3d.f

1) sequential run,
by means of variable nsc=1 in the file sor.ini, it is called the sequential subroutine 2
that compiler KAP is unable to parallelize (due to dependendances), i.e. there are no
OpenMP directive.

ds20dia.ing.unina2.it> time provasor3d
nx , ny , nz = 40 40 40
n , m , l = 41 42 41
1 - SOR sequenziale con coef.matrice scalari;
2 - SOR sequenziale con coef.matrice vettori;
3 - SOR parallelo (i,j - b/r) (k- tutti) - coef.scalari
4 - SOR parallelo (i,j - b/r) (k- tutti) - coef.vettori
5 - SOR parallelo (i,j - b/r) (k- zebra) - coef.scalari
6 - SOR parallelo (i,j - b/r) (k- zebra) - coef.vettori
Si sceglie: 1
SOR sequenziale - coef. scalari x, z
Chiamata routine 2
N. iterazioni per convergenza ellittica: 4847

real 16.8
user 16.1
sys 0.7

2) parallel run,
by means of variable nsc=5 in the file sor.ini, it is called the parallel subroutine 9
that compiler KAP is able to parallelize (due to white/black-red couloring), i.e. there are
OpenMP directive inserted in the source.

ds20dia.ing.unina2.it> time provasor3d
nx , ny , nz = 40 40 40
n , m , l = 41 42 41
1 - SOR sequenziale con coef.matrice scalari;
2 - SOR sequenziale con coef.matrice vettori;
3 - SOR parallelo (i,j - b/r) (k- tutti) - coef.scalari
4 - SOR parallelo (i,j - b/r) (k- tutti) - coef.vettori
5 - SOR parallelo (i,j - b/r) (k- zebra) - coef.scalari
6 - SOR parallelo (i,j - b/r) (k- zebra) - coef.vettori
Si sceglie: 5
SOR parallelo (i,j - b/r) (k-zebra) -coef.scalari x,y
Chiamata routine 9
N. iterazioni per convergenza ellittica: 4845

real 7.2
user 12.4
sys 0.7

__________________________________________________________________________________________

--- Execution times on Alpha two-processors compiled with KAP without auto-parallelizer:

kf90 -O5 -fast -tune ev6 -arch host -assume nounderscore -c -I/utenti/denaro
/provasor3d/ calcphi3d.f
KAP/Tru64_U_F90 4.4 k340504 20010517 28-Oct-2004 11:46:54
KAP/Tru64_U_F90 4.4 k340504 20010517 : 0 errors in file calcphi3d.f
kf90 -I/utenti/denaro/provasor3d/ -O5 -fast -tune ev6 -arch host -assume nounde
rscore provasor3d.f -o provasor3d calcphi3d.o
KAP/Tru64_U_F90 4.4 k340504 20010517 28-Oct-2004 11:47:19
KAP/Tru64_U_F90 4.4 k340504 20010517 : 0 errors in fil e provasor3d.f

3) same calling as 1)
by means of variable nsc=1 in the file sor.ini, it is called the subroutine 2

ds20dia.ing.unina2.it> time provasor3d
nx , ny , nz = 40 40 40
n , m , l = 41 42 41
1 - SOR sequenziale con coef.matrice scalari;
2 - SOR sequenziale con coef.matrice vettori;
3 - SOR parallelo (i,j - b/r) (k- tutti) - coef.scalari
4 - SOR parallelo (i,j - b/r) (k- tutti) - coef.vettori
5 - SOR parallelo (i,j - b/r) (k- zebra) - coef.scalari
6 - SOR parallelo (i,j - b/r) (k- zebra) - coef.vettori
Si sceglie: 1
SOR sequenziale - coef. scalari x, z
Chiamata routine 2
N. iterazioni per convergenza ellittica: 4847

real 16.5
user 15.9
sys 0.6

4) same calling as 2)
by means of variable nsc=5 in the file sor.ini, it is called the subroutine 9
with the white/black-red couloring without parallel loops

ds20dia.ing.unina2.it> time provasor3d
nx , ny , nz = 40 40 40
n , m , l = 41 42 41
1 - SOR sequenziale con coef.matrice scalari;
2 - SOR sequenziale con coef.matrice vettori;
3 - SOR parallelo (i,j - b/r) (k- tutti) - coef.scalari
4 - SOR parallelo (i,j - b/r) (k- tutti) - coef.vettori
5 - SOR parallelo (i,j - b/r) (k- zebra) - coef.scalari
6 - SOR parallelo (i,j - b/r) (k- zebra) - coef.vettori
Si sceglie: 5
SOR parallelo (i,j - b/r) (k-zebra) -coef.scalari x,y
Chiamata routine 9
N. iterazioni per convergenza ellittica: 4845

real 10.7
user 10.1
sys 0.6

Henry_G_Intel · ‎10-28-2004

Hello Filippo,

The compiler will give a report showing which loops were parallelized and explanations why some loops could not be parallelized. Just use the -par_report option with the -parallel option.

I encourage you to submit a feature request to Premier Support saying that you would like an option to get the transformed code from automatic parallelization.

Best regards,

Henry

denaro · ‎10-28-2004

Dear Henry,
I already used the -par_report option and I see that the loops are considered not candidate for parallelization. On the other hand the I programmed the code in a way in which dependences are eliminated and actually KAP recognizes this and performs parallelization. I am unable to understand the problem...
This is the report of the subroutine ifort can not parallelize:

[les@venere ocean]$ ifort -c -O3 -parallel -par_report3 calcphi.f
procedure: calcphi_seq_brij_zebrak_scalxz_ndy_perxz
serial loop: line 42: not a parallel candidate due to missing zero-trip test
serial loop: line 70: not a parallel candidate due to missing zero-trip test
serial loop: line 91: not a parallel candidate due to missing zero-trip test
serial loop: line 112: not a parallel candidate due to missing zero-trip test
serial loop: line 145: not a parallel candidate due to missing zero-trip test
serial loop: line 275: not a parallel candidate due to missing zero-trip test
serial loop: line 304: not a parallel candidate due to missing zero-trip test
serial loop: line 326: not a parallel candidate due to missing zero-trip test
serial loop: line 348: not a parallel candidate due to missing zero-trip test
serial loop: line 356: not a parallel candidate due to missing zero-trip test
serial loop: line 517: not a parallel candidate due to missing zero-trip test
serial loop: line 356: not a parallel candidate due to missing zero-trip test
serial loop: line 356: not a parallel candidate due to missing zero-trip test
serial loop: line 590: not a parallel candidate due to missing zero-trip test
serial loop: line 33: not a parallel candidate due to the loop being lexically discontinuous
serial loop: line 150
anti data dependence assumed from line 160 to line 168, due to "er"
anti data dependence assumed from line 160 to line 178, due to "er"
output data dependence assumed from line 160 to line 168, due to "er"
output data dependence assumed from line 160 to line 178, due to "er"
flow data dependence assumed from line 160 to line 168, due to "er"
flow data dependence assumed from line 160 to line 178, due to "er"
anti data dependence assumed from line 168 to line 160, due to "er"
anti data dependence assumed from line 168 to line 168, due to "er"
anti data dependence assumed from line 168 to line 178, due to "er"
output data dependence assumed from line 168 to line 160, due to "er"
output data dependence assumed from line 168 to line 168, due to "er"
output data dependence assumed from line 168 to line 178, due to "er"
flow data dependence assumed from line 168 to line 160, due to "er"
flow data dependence assumed from line 168 to line 168, due to "er"
flow data dependence assumed from line 168 to line 178, due to "er"
anti data dependence assumed from line 178 to line 160, due to "er"
anti data dependence assumed from line 178 to line 168, due to "er"
output data dependence assumed from line 178 to line 160, due to "er"
output data dependence assumed from line 178 to line 168, due to "er"
flow data dependence assumed from line 178 to line 160, due to "er"
flow data dependence assumed from line 178 to line 168, due to "er"
serial loop: line 183
anti data dep endence assumed from line 193 to line 193, due to "er"
anti data dependence assumed from line 193 to line 203, due to "er"
output data dependence assumed from line 193 to line 193, due to "er"
output data dependence assumed from line 193 to line 203, due to "er"
flow data dependence assumed from line 193 to line 193, due to "er"
flow data dependence assumed from line 193 to line 203, due to "er"
anti data dependence assumed from line 203 to line 193, due to "er"
output data dependence assumed from line 203 to line 193, due to "er"
flow data dependence assumed from line 203 to line 193, due to "er"
serial loop: line 207
anti data dependence assumed from line 217 to line 217, due to "er"
anti data dependence assumed from line 217 to line 227, due to "er"
output data dependence assumed from line 217 to line 217, due to "er"
output data dependence assumed from line 217 to line 227, due to "er"
flow data dependence assumed from line 217 to line 217, due to "er"
flow data dependence assumed from line 217 to line 227, due to "er"
anti data dependence assumed from line 227 to line 217, due to "er"
output data dependence assumed from line 227 to line 217, due to "er"
flow data dependence assumed from line 227 to line 217, due to "er"
serial loop: line 232
anti data dependence assumed from line 242 to line 250, due to "er"
anti data dependence assumed from line 242 to line 260, due to "er"
output data dependence assumed from line 242 to line 250, due to "er"
output data dependence assumed from line 242 to line 260, due to "er"
flow data dependence assumed from line 242 to line 250, due to "er"
flow data dependence assumed from line 242 to line 260, due to "er"
anti data dependence assumed from line 250 to line 242, due to "er"
anti data dependence assumed from line 250 to line 250, due to "er"
anti data dependence assumed from line 250 to line 260, due to "er"
output data dependence assumed from line 250 to line 242, due to "er"
output data dependence assumed from line 250 to line 250, due to "er"
output data dependence assumed from line 250 to line 260, due to "er"
flow data dependence assumed from line 250 to line 242, due to "er"
flow data dependence assumed from line 250 to line 250, due to "er"
flow data dependence assumed from line 250 to line 260, due to "er"
anti data dependence assumed from line 260 to line 242, due to "er"
anti data dependence assumed from line 260 to line 250, due to "er"
output data dependence assumed from line 260 to line 242, due to "er"
output data dependence assumed from line 260 to line 250, due to "er"
flow data dependence assumed from line 260 to line 242, due to "er"
flow data dependence assumed from line 260 to line 250, due to "er"
serial loop: line 392
anti data dependence assumed from line 402 to line 410, due to "er"
anti data dependence assumed from line 402 to line 420, due to "er"
output data dependence assumed from line 402 to line 410, due to "er"
output data dependence assumed from line 402 to line 420, due to "er"
flow data dependence assumed from line 402 to line 410, due to "er"
flow data d ependence assumed from line 402 to line 420, due to "er"
anti data dependence assumed from line 410 to line 402, due to "er"
anti data dependence assumed from line 410 to line 410, due to "er"
anti data dependence assumed from line 410 to line 420, due to "er"
output data dependence assumed from line 410 to line 402, due to "er"
output data dependence assumed from line 410 to line 410, due to "er"
output data dependence assumed from line 410 to line 420, due to "er"
flow data dependence assumed from line 410 to line 402, due to "er"
flow data dependence assumed from line 410 to line 410, due to "er"
flow data dependence assumed from line 410 to line 420, due to "er"
anti data dependence assumed from line 420 to line 402, due to "er"
anti data dependence assumed from line 420 to line 410, due to "er"
output data dependence assumed from line 420 to line 402, due to "er"
output data dependence assumed from line 420 to line 410, due to "er"
flow data dependence assumed from line 420 to line 402, due to "er"
flow data dependence assumed from line 420 to line 410, due to "er"
serial loop: line 425
anti data dependence assumed from line 435 to line 435, due to "er"
anti data dependence assumed from line 435 to line 445, due to "er"
output data dependence assumed from line 435 to line 435, due to "er"
output data dependence assumed from line 435 to line 445, due to "er"
flow data dependence assumed from line 435 to line 435, due to "er"
flow data dependence assumed from line 435 to line 445, due to "er"
anti data dependence assumed from line 445 to line 435, due to "er"
output data dependence assumed from line 445 to line 435, due to "er"
flow data dependence assumed from line 445 to line 435, due to "er"
serial loop: line 449
anti data dependence assumed from line 459 to line 459, due to "er"
anti data dependence assumed from line 459 to line 469, due to "er"
output data dependence assumed from line 459 to line 459, due to "er"
output data dependence assumed from line 459 to line 469, due to "er"
flow data dependence assumed from line 459 to line 459, due to "er"
flow data dependence assumed from line 459 to line 469, due to "er"
anti data dependence assumed from line 469 to line 459, due to "er"
output data dependence assumed from line 469 to line 459, due to "er"
flow data dependence assumed from line 469 to line 459, due to "er"
serial loop: line 474
anti data dependence assumed from line 484 to line 492, due to "er"
anti data dependence assumed from line 484 to line 502, due to "er"
output data dependence assumed from line 484 to line 492, due to "er"
output data dependence assumed from line 484 to line 502, due to "er"
flow data dependence assumed from line 484 to line 492, due to "er"
flow data dependence assumed from line 484 to line 502, due to "er"
anti data dependence assumed from line 492 to line 484, due to "er"
anti data dependence assumed from line 492 to line 492, due to "er"
anti data dependence assumed from line 492 to line 502, due to "er"
output data dependence assumed from line 492 to line 484, due to "er"
output data dependence assumed from line 492 to line 492, due to "er"
output data dependence assumed from line 492 to line 502, due to "er"
flow data dependence assumed from line 492 to line 484, due to "er"
flow data dependence assumed from line 492 to line 492, due to "er"
flow data dependence assumed from line 492 to line 502, due to "er"
anti data dependence assumed from line 502 to line 484, due to "er"
anti data dependence assumed from line 502 to line 492, due to "er"
output data dependence assumed from line 502 to line 484, due to "er"
output data dependence assumed from line 502 to line 492, due to "er"
flow data dependence assumed from line 502 to line 484, due to "er"
flow data dependence assumed from line 502 to line 492, due to "er"

denaro · ‎10-28-2004

I forgot ... when instead I use the fortran subroutine parallelized by KAP with the OpenMP directive and compile with -openmp option, the activities of the two CPUs confirm that the code runs with two threads each one on a processore:

top - 21:51:55 up 29 days, 2:44, 2 users, load average: 1.31, 0.52, 0.42
Tasks: 62 total, 2 running, 60 sleeping, 0 stopped, 0 zombie
Cpu0 : 99.7% us, 0.3% sy, 0.0% ni, 0.0% id, 0.0% wa, 0.0% hi, 0.0% si
Cpu1 : 100.0% us, 0.0% sy, 0.0% ni, 0.0% id, 0.0% wa, 0.0% hi, 0.0% si
Cpu2 : 0.0% us, 0.0% sy, 0.0% ni, 100.0% id, 0.0% wa, 0.0% hi, 0.0% si
Cpu3 : 0.0% us, 0.0% sy, 0.0% ni, 100.0% id, 0.0% wa, 0.0% hi, 0.0% si
Mem: 3106328k total, 2433712k used, 672616k free, 127052k buffers
Swap: 6193016k total, 10172k used, 6182844k free, 1664648k cached

bye
Filippo

ClayB · ‎11-03-2004

Filippo -

Do you get any better parallelization results from some of the loops if you put in the compiler directive to ignore dependencies? This would be...

Code:

!DEC$ IVDEP

See pages 14-23 to 14-25 of the Intel Fortran Language Reference for more details on the directive.

Since the messages you are getting from the compiler are assuming that there is a dependency (and both you and KAP know that this is not a true dependence), the above directive may be enough to cue the Intel compilers to that same fact.

I realize that it will be a pain to put in directives by hand when a tool (KAP) can see through the dependence automatically. There will always be situations where this will be the case. Automatic detection of safe loops will never beat old-fashioned human inspection and knowledge.

--clay

denaro · ‎11-03-2004

Dear Clay

many thanks for your suggestion, I will try it and look again to performances.
As a matter of fact, I am used to see the KAP transcripten code to be helped in programming the original source. When KAP fails I am sure that I have to work on that part of the code to eliminate (when possible) dependencies.
For example, I am quite sure that one of the problems that degrade my performances on Xeon is that KAP performs an interchange in the loops, that is, I originally programmed the cycles following k,j,i ordering for 3d arrays in order to optimize cache using. Actually, KAP changes j and k loops but this does not degrade the performance on Alpha owing to the wider L2 cache. On the other hand, this fact can be of some impact on Xeon and affect performances on parallelization. Isn't it?

I think this way of working is the best way to optimize human factor and auto-parallelization softwares, unfortunately Intel acquired and killed KAP product ....

best regards
Filippo