Re: Parallel Performance on Xeon

bala · ‎08-08-2005

Hello,

I would like to hear the Forum's opinion on a performance issue that I am facing. The application I run is called CPMD. It does ab initio molecular dynamics. It is a freeware code. In the following, I used, CPMD-3.9.2 version, lam-7.1.1 version,
Intel MKL v 8.0 (Beta), and Intel Fortran Compiler version 8.1.028. The operating
system on the Nocona is Fedora Core 4 (x86_64 version) and on the Pentium 4, it is 32-bit Mandrake 10.1. I have tried the 32-bit OS on the Nocona also, with the same performance as mentioned below.

Here are the performance benchmarks:

---------------------------------------------------------------------------
Case A:

Intel SP2 motherboard, dual Nocona@3.2GHz (1 MB L2 cache), with
4GB DDR 333MHz memory.

Single CPU job : 29 MINUTES 48.84 SECONDS

Two CPU job (Parallel, using LAM-MPI): 26 MINUTES 48.41 SECONDS
(Within the same node)

Two CPU job (Parallel, using LAM-MPI) : 20 MINUTES 10.18 SECONDS
(Across two Xeon nodes)

----------------------------------------------------------------------------
CASE B:
Pentium-4@3.2GHz machines clustered using Gigabit ethernet and LAM-MPI
(TP1 Motherboard, 1GB DDR-400MHz RAM, 1MB L2 cache)

Single CPU job : 32 MINUTES 0.98 SECONDS
Two CPU job (Parallel, using LAM-MPI): 19 MINUTES 27.10 SECONDS
(Across Gigabit network)
----------------------------------------------------------------------------

All times are wall clock times. The motherboard has 800MHz FSB.
A single CPU job takes about 400MB of memory.

Why is the Two CPU parallel job on the Xeon (Item #2 in CASE A) so slow when compared to the two cpu job across Gbit network?

Is it because the job is limited by memory bandwidth?

Many Thanks!

Cheers,
Bala

TimP · ‎08-08-2005

The performance effects you quote look unusually high. It is possible there is memory bus contention. Have you considered whether you have high cache miss rates, or write combine buffer evictions, and what might be done about them?
In the case where the 8.0 compiler vectorizer optimizes loops with more than 4 assignments per loop for the Northwood architecture, the default in the 9.0 64-bit compiler is changed to correspond with Nocona, which handles up to 6 assignments per loop well. During compilation, the number of "partial loop vectorization" reports may be not so large.
When loops with large numbers of assignments are not vectorized, the compiler doesn't split ("distribute") for matching with the number of write combine buffers.

bala · ‎08-09-2005

Hi Tim,
Thanks for your reply. I upgraded the compiler to version 9 on the EM64T. However, the 2-cpu performance is still the same as earlier. Please check out:

http://www.theochem.ruhr-uni-bochum.de/~axel.kohlmeyer/cpmd-bench.html

There, search for the word "abysmal". The benchmarks reported there mirrors my experience. Is there something that I could do to salvage the situation? (I have 16 of these dual Nocona boxes).

Thanks,
Bala

TimP · ‎08-09-2005

As you are working with source code, you could get a fair idea of the SMP performance scaling by function, using gprof profiling. I am somewhat surprised that the web site spends so much space on SMP scaling issues without giving any information respecting which types of operation are responsible for limited scaling.
I note that a few of the machines showing high SMP scaling are running at a low clock speed compared to the memory bus. For example, the zx6000, for the last 2 years of its production, normally ran at 1.3Ghz on the identical memory system as the early 900Mhz examples. I have one of those heat generators in my office.
Is the strategy you use for MKL functions giving good scaling there? According to the web site, it seems you should try running MKL in serial mode, leaving parallelism to your MPI. Running 2 MKL threads per CPU, with HyperThreading enabled, compared with 1 thread per CPU, HT disabled, would provide interesting data, particularly if you are able to collect performance data by function. In case of need, the relevant environment variables could be adjusted.
The web site points out the possibility of scheduler difficulties. If you run with HT enabled, and the scheduler is imperfect, it is quite possible to produce the effect you observe. 2.6 kernel schedulers generally perform much better than 2.4, but I don't know of any comparisons comparing FC with SuSE and RHEL4.

bala · ‎08-09-2005

Hi Tim,
Thanks. I got to leave office now. Just a quick note. I have disabled HT in the BIOS.
I will check on the MKL issues that you have raised tomorrow.

Thanks!

Best,
Bala

ClayB · ‎08-09-2005

Bala -

I would have to agree with Tim that the problem might be memory contention. Moving the second process to another processor on a separate box will relieve the contention within a box in exchange for some overhead with passing data between the two processes via MPI. Do you know how much data is passed and how often MPI is used to share data between processes.

If there are computationally intensive portions of the code that just don't cycle through the data within a process, you might look at threading these to take advantage the extra processor. Of course, this might just bring up the same memory contention issues that you were seeing with two processes on the same box. In either case, it would be interesting to note what the bus utilization is for the application.

--clay