We have a CFD application compiled on Windows with Intel FORTRAN compiler version 220.127.116.11.
Running on Windows 10 workstation with 2 Xeon E5-2630 v3 processors @2.4 GHz, 8 physical cores each - 16 in total. 64 GB of RAM. Not hyper-threaded.
The application is OpenMP parallel, with mostly static scheduling. It is running on 16 processors utilizing them almost 100% (CPU_time/(nprocs*clock_time). While running this application, we initiate another instance of it, also running on all 16 processors. The two instances (simulations) of the application are completely independent of each other. Naturally, both use about 50% of the CPU power,
When the second simulation starts, the first one speeds up by a factor of 2 and even 2.5. As soon as the second simulation stops, the first one goes back to the original speed.
We are struggling to understand this behavior. There are no special affinity settings used in either run, using the default settings.
Tim, could you please elaborate? Why is the cache sized increased when the second simulation starts? More importantly, can the cache size be increased when only one simulation runs?
Also, would you call this behavior 'superlinear speedup'?
Without controlling process and memory affinity, it is extremely difficult to understand the behavior parallel programs on any system (and it is generally not worth the effort to try).
It is actually not too difficult to come up with hypotheses that would result in the behavior you observe, but without control (or extensive instrumentation), there is no way to evaluate such hypotheses....
Your first program uses all 16 cores, but instantiates all of its memory on socket 0. Its "normal" mode of operation is slow because all the threads running on socket 1 are accessing their data remotely. When the second program is started, the operating system packs all the threads of the first program into socket 0, and packs all the threads of the second program into socket 1. Now the first program is accessing local memory and runs much faster. 2x to 2.5x is not out of the question.
Thank you, John. This makes sense. With memory access bound performance, which is the case for our application, it is easy to imagine how the reduction in available CPU power can be well compensated by faster memory access. WE are going to try VTune to analyze the situation.