Hi Jim, thanks for your

Paluszynski__Radek · ‎05-13-2019

Hello everyone,

I have a Dell Precision Tower 7910 with the Dual Intel® Xeon® Processor E5-2697 v4 (18C, 2.3GHz, 3.6GHz Turbo, 2400MHz, 45MB, 145W). I use it to run various programs written in MPI Fortran. The processor has 36 physical cores, but because of the hyperthreading technology, the system shows 72 CPUs.

I've always achieved peak performance with my programs at 36 processes sharp. Any higher number of processes resulted in a dramatic slowdown. I took it as a sign that the hyperthreading isn't working well, at least for my application.

But recently a friend ran the same code on a similar workstation with the Dual Intel Xeon Gold 6148 2.4GHz, 3.7GHz, 20C, 10.4GT/s 3UPI, 27M Cache, HT (150W) DDR4-2666. The same program achieved a continuous speedup up to 80 processes!

This makes me wonder if I need to change some option in my system, or install a different compiler version? I'm puzzled. I would appreciate any advise on this issue! Below I'm attaching some details of the two systems (output of cpuinfo and compiler versions).

My system:

===== Processor composition =====
Processor name : Intel(R) Xeon(R) E5-2697 v4
Packages(sockets) : 2
Cores : 36
Processors(CPUs) : 72
Cores per package : 18
Threads per core : 2

ifort (IFORT) 19.0.0.117 20180804
Copyright (C) 1985-2018 Intel Corporation. All rights reserved.

Friend's system:

Processor name : Intel(R) Xeon(R) Gold 6148
Packages(sockets) : 2
Cores : 40
Processors(CPUs) : 80
Cores per package : 20
Threads per core : 2

ifort (IFORT) 19.0.3.199 20190206
Copyright (C) 1985-2018 Intel Corporation. All rights reserved.

Juergen_R_R · ‎05-13-2019

Did you exclude any trivial things, like e.g. that with more than 36 threads you are exceeding your memory and the application starts to swap? But I also made the observation that if you use more threads than physical cores, on some systems this can slow down the execution.

jimdempseyatthecove · ‎05-14-2019

This may be the case of architectural differences

The E5 uses QPI with 2 links, whereas the Gold uses UPI with 3 links.
The E5 has 4 memory channels, whereas the Gold has 6 memory channels.

It is not clear as to if all 3 UPI links are utilized on a 2S system (3rd link is intended for 4S), however UPI is the next generation and one would think is faster.

The number of memory channels difference may be the choking point.

Was the application compiled differently for each system? (E5 with AVX2, Gold with AVX512)

Seeing that you are using MPI, are both systems using the same version of MPI?

And finally: Which system was faster?

Jim Dempsey

Paluszynski__Radek · ‎05-15-2019

Hi Jim, thanks for your answer.

Yes, the program was compiled separately on the two machines and the compiler and MPI version is slightly different, but both are fairly recent.

I would say that for a given number of cores, the application performs similarly on both computers. The main difference arises in that beyond 36 processes my system begins to choke, while the other one displays a continuous speedup all the way up to 80 processes.

Is there any parallel tuning application that could shed light on what's going on?

I would appreciate all help!

jimdempseyatthecove · ‎05-15-2019

This might help:

https://software.intel.com/en-us/vtune-amplifier-help-mpi-code-analysis

Perform the analysis on both machines. I suggest though to copy the execuitables (and redistributables) from your system to your friend's system (into segregated folders) such that the two runtime characteristics are as similar as possible.

There was another post on this forum which I haven't taken my time to locate, that different versions of MPI used different pathwy for intra-node (same node) messaging. This potentially can be a problem when large numbers of inter-process messaging occurs within a node.

Jim Dempsey

jimdempseyatthecove · ‎05-15-2019

** after thought

You might be able to first test on your machine. Maybe something will show up. Also see if you can get the MPI/Hydra redistributables from your friend's machine. If this is/was the problem this will save you hours of your time.

Jim Dempsey

Intel Fortran MPI hyperthreading performance issue