Slower program on dual Xeon vs single Xeon

Reinaldo_Garcia · ‎02-02-2011

I have a program that runs slower on a dual Xeon computer (2 x X5677 3.47 Ghz) with respect to a single Xeon (1 x Xeon X5677 3.47 Ghz).

Initially I ran the same exe on both computers with the following compiler options:

/nologo /O3 /Qparallel /Qopenmp /Qvec-report0 /Qzero /module:"x64\\Release\\\\" /object:"x64\\Release\\\\" /traceback /libs:static /threads /winapp /c

The dual Xeon execution time is 76.7 hrs. and the single Xeon 56.7 hrs.

Then I turned off the parallelization options like this:

/nologo /O3 /Qpar-threshold:0 /Qvec-threshold:0 /assume:nocc_omp /Qvec-report0 /Qzero /module:"x64\\Release\\\\" /object:"x64\\Release\\\\" /traceback /libs:static /threads /winapp /c

Then the execution times are:

Dual Xeon: 72 hrs. , single Xeon: 52 hrs.

In all tests the single processor is consistently faster than the dual processor, even when paralleliztion is turned off.

I understand the parallelization is not helping anyway, but the thing this is a legacy code migrated from F77 that was not very efficiently programmed in the first place. We are trying to optimized a bit, but dual processor is hurting instead of helping.

Any ideas?

Thanks,

R//G

Martyn_C_Intel · ‎02-02-2011

Hi,
I think there's a misconception here. /Qpar-threshold:0 does not turn auto-parallelization off, it enables more parallelization, of loops for which the performance is not expected to improve (such as loops with low trip counts). To disable auto-parallelization, remove /Qparallel, or use /Qparallel-. Likewise, /Qvec-threshold:0 vectorizes loops that really shouldn't be vectorized. I recommend that you remove it. There's no reason to disable vectorization. I wouldn't normally use both /Qparallel and /Qopenmp together. If you don't have OpenMP directives, don't use /Qopenmp. If you do, then don't use /Qparallel.

Then, you can compare a build without /Qparallel to a build with /Qparallel, and to a second build with /Qparallel /Qpar-threshold:99. Those are very long run times. Can you test on smaller workloads or data sets?
Whether /Qparallel helps at all will depend on how your code is structured. If you are willing to spend the time, there's usually a lot more potential in OpenMP, because you can thread at a higher level. Unless you're fortunate enough to have one or two hot and simple kernels (such as matrix multiplications).
Have you tried to identify your hotspots?

Some useful (I hope)references:
http://software.intel.com/en-us/articles/threading-fortran-applications-for-parallel-performance-on-multi-core-systems/
http://software.intel.com/en-us/articles/automatic-parallelization-with-intel-compilers/
http://software.intel.com/en-us/articles/performance-tools-for-software-developers-auto-parallelization-and-qpar-threshold/
http://software.intel.com/en-us/articles/requirements-for-vectorizable-loops/

Reinaldo_Garcia · ‎02-03-2011

Thanks Martyn:

We will try the compiler options you suggest and see the difference with a shorter run.

We tried OpenMP but it did not help. This code is not very well structure at all, and I think that is the cause of the parallization failure to generate faster code. We have used OpenMP extensively in other well structured codes with great success.

I will report results of new test as soon as they are done. I really appreciate your input.

R//G