Re: Why the parallelization does not work?

xu_jocelyn · ‎01-05-2008

With IVF Compiler 10.1, A same program performs on IA32 architecture (Intel Pentium D CPU/ Ms Windows XP Home Edition) and Intel 64 architecture (Dual-Core AMD Opteron Processor 2214 / Ms Windows Server 2003 Enterprise x64 Edition SP2) respectively. Both projects are optimized with '/O3 /Og /QaxN /QxN /Qparallel' etc. Few loops in program main are auto-parallelized or vectorized.

Then, I add some '!DEC$ PARALLEL' directives before the appropriate loops, however, no otherinformation of 'auto-parallelized' appears, and the running efficiency does not improved in fact. Why the directive does not work?

Moreover, the dual-core or quad-core processors above seem not work, while the auto-parallelized program runs almost with only single core. Could the auto-parallelization fully use all cores of CPU? Or my settings have any mistakes? What is the best setting about above two systems?

Thanks!

TimP · ‎01-05-2008

You don't mention what results you got with -Qpar-threshold. Reducing the value from the default 100 generally produces more parallel loops, some with performance improvement, some not. Likewise, increasing the value set in -Qpar-report will give you the compilers comments about reasons for not parallelizing.
OpenMP directives generally give you better control than auto-parallelization. ifort generally places higher priority on vectorization than threading, since that is the best way to improve performance.
Without an example of what you are trying to do, I don't see how anyone could comment further.

xu_jocelyn · ‎01-06-2008

Thanks a lot.

According to your suggestion, after I have reducing the -Qpar-threshold value to 0, and increase -Qpar-report and -Qvec-report value to 3 and 5, respectively, the rich compilers comments show. There are many remarks: loop was not vectorized: unsupported loop structure. And one of the loops is as follow,

do ix=1,Nx
x(ix)=(ix-Nx/2)*dx
enddo

where, the Nx is a integer constant, dx is a real*8 constant, and x is a real*8 array. Why this kind of loop can not vectorized? Other remarks: loop was not vectorized: unsupported data type. Which kinds of data could vectorized?

There is some introduction of IVF compiler 10 about HPO (High Performance, Parallel Optimizer), which combines automatic vectorization, automatic parallelization and loop transformations into a single pass which is faster, more effective and more reliable than prior discrete phases. How could I use it? It seems not to be refered much in IVF Compiler Documentation...

About the final point in my former post,

'Moreover, the dual-core or quad-core processors above seem not work, while the auto-parallelized program runs almost with only single core. Could the auto-parallelization fully use all cores of CPU?... '.

Could all the cores be used synchronously when a parallelized program running,where theCPUutilization is not 25% or 50% but 100%?How to deal with it?

TimP · ‎01-06-2008

do ix=1,Nx
x(ix)=(ix-Nx/2)*dx
enddo
Vectorizability of this loop may depend on which architecture option you set. -QxT gives you the most chances for Core, not that I would recommend you alter your choice. Another, possibly better way (if you don't have accuracy issues here), is to change to:

double precision tmp
...
tmp = (1-Nx/2)*dx
!dir$ loop count(4000)
do ix=1,Nx
x(ix)=tmp
tmp=tmp+dx
enddo

Look up the loop count directive; it's purpose is to tell the compiler to optimize for specific ranges of loop count. If Nx, or size(x), are known at compile time, the compiler would take those as the default size for optimization. If the loop count is greater than 500000, cache bypass would be preferred. It might be better anyway, if this array isn't used soon after it is set. You could specify it directly by
!dir$ vector nontemporal
Presumably, the comment about vectorization would have been about mixed data types, or possibly "seems inefficient," either of which can be dealt with by these changes.

HPO is the default since ifort 10.

You still aren't clear on your final question. In case you mean "does a loop such as this benefit from threading, in addition to vectorization?" the answer might be "no, unless the loop count is extremely large, maybe not even then, on a single socket platform (on account of memory buss saturation)".
If you are interested in performance, you will need to perform detailed testing to find out whether aggressive parallelization by major reductions in par-threshold is helping.