Intel® Fortran Compiler
Build applications that can scale for the future with optimized code designed for Intel® Xeon® and compatible processors.
Announcements
FPGA community forums and blogs on community.intel.com are migrating to the new Altera Community and are read-only. For urgent support needs during this transition, please visit the FPGA Design Resources page or contact an Altera Authorized Distributor.
29286 Discussions

Slower program on dual Xeon vs single Xeon

Reinaldo_Garcia
Beginner
425 Views
I have a program that runs slower on a dual Xeon computer (2 x X5677 3.47 Ghz) with respect to a single Xeon (1 x Xeon X5677 3.47 Ghz).
Initially I ran the same exe on both computers with the following compiler options:

/nologo /O3 /Qparallel /Qopenmp /Qvec-report0 /Qzero /module:"x64\\Release\\\\" /object:"x64\\Release\\\\" /traceback /libs:static /threads /winapp /c

The dual Xeon execution time is 76.7 hrs. and the single Xeon 56.7 hrs.

Then I turned off the parallelization options like this:

/nologo /O3 /Qpar-threshold:0 /Qvec-threshold:0 /assume:nocc_omp /Qvec-report0 /Qzero /module:"x64\\Release\\\\" /object:"x64\\Release\\\\" /traceback /libs:static /threads /winapp /c

Then the execution times are:

Dual Xeon: 72 hrs. , single Xeon: 52 hrs.

In all tests the single processor is consistently faster than the dual processor, even when paralleliztion is turned off.

I understand the parallelization is not helping anyway, but the thing this is a legacy code migrated from F77 that was not very efficiently programmed in the first place. We are trying to optimized a bit, but dual processor is hurting instead of helping.

Any ideas?

Thanks,

R//G


0 Kudos
2 Replies
Martyn_C_Intel
Employee
425 Views
Hi,
I think there's a misconception here. /Qpar-threshold:0 does not turn auto-parallelization off, it enables more parallelization, of loops for which the performance is not expected to improve (such as loops with low trip counts). To disable auto-parallelization, remove /Qparallel, or use /Qparallel-. Likewise, /Qvec-threshold:0 vectorizes loops that really shouldn't be vectorized. I recommend that you remove it. There's no reason to disable vectorization. I wouldn't normally use both /Qparallel and /Qopenmp together. If you don't have OpenMP directives, don't use /Qopenmp. If you do, then don't use /Qparallel.

Then, you can compare a build without /Qparallel to a build with /Qparallel, and to a second build with /Qparallel /Qpar-threshold:99. Those are very long run times. Can you test on smaller workloads or data sets?
Whether /Qparallel helps at all will depend on how your code is structured. If you are willing to spend the time, there's usually a lot more potential in OpenMP, because you can thread at a higher level. Unless you're fortunate enough to have one or two hot and simple kernels (such as matrix multiplications).
Have you tried to identify your hotspots?

Some useful (I hope)references:
http://software.intel.com/en-us/articles/threading-fortran-applications-for-parallel-performance-on-multi-core-systems/
http://software.intel.com/en-us/articles/automatic-parallelization-with-intel-compilers/
http://software.intel.com/en-us/articles/performance-tools-for-software-developers-auto-parallelization-and-qpar-threshold/

http://software.intel.com/en-us/articles/requirements-for-vectorizable-loops/
0 Kudos
Reinaldo_Garcia
Beginner
425 Views
Thanks Martyn:
We will try the compiler options you suggest and see the difference with a shorter run.
We tried OpenMP but it did not help. This code is not very well structure at all, and I think that is the cause of the parallization failure to generate faster code. We have used OpenMP extensively in other well structured codes with great success.
I will report results of new test as soon as they are done. I really appreciate your input.
R//G
0 Kudos
Reply