auto-parallerization - no effect

vladimir_t1 · ‎09-17-2006

Hi,

I have very complex FORTRAN numerical simulation program. I run it on system with Core 2 duo E6600 and FORTRAN 9.1 64bit + XP 64bit.

I see absolutly no difference in performance when I compile the code with /Qparallel or without it. With /Qparallel utilization of the processor is 100% - it means that both cores work but there is no effect. Why is it so?

Otherwise I use options /O3 /Qipo /QxT.

Thanks for any help or suggestions.

Vladimir

TimP · ‎09-17-2006

Any of the usual questions in OpenMP parallel performance might apply, and you have given no information to narrow it down. If your parallel loops are such that schedule(guided) or the like is needed, you would have to specify it by OpenMP, or wait for ifort 10.0 to do it within auto-parallelization.
With typical vectorized code, which depends both on CPU and bus performance, I expect about 50% performance increase by parallelism on Core 2 Duo. It is possible for code with little cache locality to saturate Front Side Bus with one thread. That could happen sooner with slow RAM and fast CPU. Even some 2.13Ghz Core 2 Duo models use DDR2-667, so the fast RAM is not considered too much for a slow CPU.

vladimir_t1 · ‎09-18-2006

Thank you. Unfortunatelly I cannot modify the code of this simulation.It is too complex and I am not the author. So, how long it will take when ifort 10.0 will be released?

Otherwise I have found interesting thing that the compiled code is the fastest with option: /Qipo /QxW /O3

Steven_L_Intel1 · ‎09-18-2006

There is no "magic bullet" here. Tim's mention of "10.0" is about work we're doing for a future release to improve auto-parallel for some types of applications, but there's no guarantee that your application would benefit and it's not going to appear until sometime next year.

You should turn on the optimization reports and see which loops are parallelized and which not and for those not, why. The Optimizing Applications manual has a lot of helpful information.

In general, auto-parallel won't give you as good results as "directed parallelization" using OpenMP. This requires that you understand your application and properly add the right directives. Using Intel VTune to analyze the performance would also be enlightening.

If I had to guess (and I do), it could be that your program is limited by memory bandwidth rather than CPU. There may be some simple things you can do in your code to help. VTune is very good at uncovering such issues.