Strange performance behaviour when using MKL functions in multiple processes

Ton_Kostelijk · ‎01-16-2009

I have a question for you:

Are there cross process synchronisation primitives used in the Intel MKL 10.0.5 library?

I ask this because we are getting strange results when using MKL in concurrent processes.

For a customer we are investigating possible performance improvements of their Fortran code. This code will run on intel Xeon based blades (Xeon 5430 with 4 cores), with 2 Xeons per blade and a total of 14 blades (so a total of 112 cores).

The fortran code consists of three major parts.

1) Own code, which makes up about 20 percent of the wallclock execution time.
2) MKL functions (excluding FFT's), which also makes up about 20 percent of the wallclock execution time
3) MKL FFT calls, which makes up about 60 percent of the wallclock execution time.

The initial implementation used the MKL library in its multihreaded form.

This implementation was not fast enough for the required application, so we where asked to look into ways of improving execution time (wallclock time)

Note: They also had an implementation that was done with a different FFT library (singlethreaded), but with the rest of the code the same (so also using the MKL function for everything else except the FFT's). I'll call this one NO_MKL_FFT in the rest of the text

During our investigations with VTune we discovered that the multithreaded version of MKL performed about 25% beter than the NO_MKL_FFT version on our test Xeon (a Xeon 5420 at 2.5 Ghz). This is in wallclock time.

Since we wanted to know what the speedup factor of using the multithreaded MKL library was against the singlethreaded library we also did some tests using that version.

To our surprise we saw that the wallclock time for the singlethreaded MKL version was the virtually the same as for the NO_MKL_FFT version, which surprised us.

The conclusion was that using the multithreaded version of MKL only gave us a speedboost of 25% when running on the quadcore Xeon. Since the majority of the runtime (80%) consists of MKL functions this was somewhat of a disappointment.
Especially since we can only improve performance for the remaining 20%, which would not gain us a lot of performance improvement.

To get a better idea of the possible performance gain we tried some tests using a different strategy.
Instead of running one application with internal multithreaded behaviour we tried to run 4 copies of the singlethreaded application concurrently (with each process tied to a specific core with specific affinity).

For this case we expected a performance improvement factor of between 3 and 4 (we expected some inter process interference because of cache loads and memory access, so 4 was not seen as realistic).

To our surprise we saw that the performance gain from doing this was onlu 1.9 (the wallclock time for this test went from 25 seconds to 54 seconds, which is a factor of 1.9 faster than would have been the case when running the four processes sequentially, which would have taken 4 * 25 = 100 seconds).

While this is still significantly more than the 25% speedup of the multithreaded MKL build, it is no where near the results we expected from this test (which was a factor of 3 to 4).

As a separate test, to verify our assumptions regarding the possible speedup factor, we also performed this 4 process test with the NO_MKL_FFT build. To our surprise we saw that this test gave us a wallclock time of 30.3 seconds for the slowest process (27.4 for the fastest, which happened to be the first process to start).

This gives us a speedup factor of 3.3 for the NO_MKL_CASE, more in line with what we had expected for the singlethreaded MKL build!

We analysed both cases (and the single process runs using the singlethreaded versions) with VTune and we noticed that the MKL parts of the code where heavily subject to runtime increase when run multiprocess. This was the case for both the MKL and NO_MKL_FFT builds.

In the case of the MKL build the increase of wallclock time for the MKL generic parts was a factor of 2.65, the MKL FFT increased by a factor of 1.83 while the own code increased by a factor of 1.66 (as determined for one of the four running processes).

Looking at the contribution of each of the parts this boils down to an average slowdown of a factor of 2.11, which matches our wallclock time discrepancy.

In the case of NO_MKL_FFT, we see similar behaviour for the own code and the other MKL code, but the FFT code (from a different library) performs in exactly the same wallclock time!

So the slowdown of 15% is caused by the customers own code and the other MKL functions, not the FFT!

We can't explain the difference in behaviour (and the extreme slowdown of the MKL library) from a cache or memory point of view. So we suspect that we are now dealing with a synchronisation related internal delay, caused by calling the same functions from different processes.

Can anybody shed light on this issue?

TimP · ‎01-16-2009

I don't see any mention of your affinity investigations. If you start more than 1 process on a node, by default threaded MKL will not effectively take appropriate distinct groups of cores for its processes. If you rely only on the facilities provided with MKL, you would have to set KMP_AFFINITY separately by process so as to take non-confllicting core affinities.
Intel MPI has environment variables, which you must set yourself, which partition the cores on each node appropriately so that the OpenMP will not conflict among processes.

Ton_Kostelijk · ‎01-19-2009

Quoting - tim18

I don't see any mention of your affinity investigations. If you start more than 1 process on a node, by default threaded MKL will not effectively take appropriate distinct groups of cores for its processes. If you rely only on the facilities provided with MKL, you would have to set KMP_AFFINITY separately by process so as to take non-confllicting core affinities.
Intel MPI has environment variables, which you must set yourself, which partition the cores on each node appropriately so that the OpenMP will not conflict among processes.

Maybe I didn't make myself quite clear.

We used the sequential version of the MKL library (link level, by using mkl_intel_c.lib, mkl_sequential.lib and mkl_core.lib) and ran seperate processes, each with the affinity set to a particular processor (first process locked to core 1, second process locked to core 2 etc).

We checked if the affinity was indeed locked to the specified processor using VTune and that was the case.

And still we saw these strange delays when running the processes this way. As if the processes where somehow waiting on each others use of the MKL functions before proceeding.

I've even tried time shifted execution of the processes (by adding a delay of 10 seconds between the start of each process) and the first process to start and the last process to finish have a shorter execution time compared to the middle processes. The first and last process only have partial overlap between 3 processes, while the middle ones have full overlap of 3 processes.

The maximum wallclock time went down from the original 54 seconds to 40 seconds (for the slowest case, 35 seconds for the fasted case).

The most likely explanation we can think of for this is that there is some sort of low-level synchronisation taking place in MKL (using cross process synchronisation primitives, like for instance mutexes).

Ton_Kostelijk · ‎01-26-2009

Quoting - ton.kostelijkphilips.com

Quoting - tim18

I don't see any mention of your affinity investigations. If you start more than 1 process on a node, by default threaded MKL will not effectively take appropriate distinct groups of cores for its processes. If you rely only on the facilities provided with MKL, you would have to set KMP_AFFINITY separately by process so as to take non-confllicting core affinities.
Intel MPI has environment variables, which you must set yourself, which partition the cores on each node appropriately so that the OpenMP will not conflict among processes.

Maybe I didn't make myself quite clear.

We used the sequential version of the MKL library (link level, by using mkl_intel_c.lib, mkl_sequential.lib and mkl_core.lib) and ran seperate processes, each with the affinity set to a particular processor (first process locked to core 1, second process locked to core 2 etc).

We checked if the affinity was indeed locked to the specified processor using VTune and that was the case.

And still we saw these strange delays when running the processes this way. As if the processes where somehow waiting on each others use of the MKL functions before proceeding.

I've even tried time shifted execution of the processes (by adding a delay of 10 seconds between the start of each process) and the first process to start and the last process to finish have a shorter execution time compared to the middle processes. The first and last process only have partial overlap between 3 processes, while the middle ones have full overlap of 3 processes.

The maximum wallclock time went down from the original 54 seconds to 40 seconds (for the slowest case, 35 seconds for the fasted case).

The most likely explanation we can think of for this is that there is some sort of low-level synchronisation taking place in MKL (using cross process synchronisation primitives, like for instance mutexes).

We have done some more testing and the cause of this issue seems to be the zcopy function.

We've created a small test program which does nothing but copy information from one array to another using zcopy (which is using the mkl version of zcopy) and when running this program in 4 processes at the same time on the Xeon, we see the execution time rising from 35 seconds (single process) to 135 seconds (4 processes simultaneously).

This is an almost fourfold increase in execution time.

We've also done a test with a program containing only FFT code and we see a increase of a factor 2 in execution time for the four processes running simultaneously compared to the single process.

Checking with VTune shows that the division of work in the executable (single process case) is that there is a 50/50 split between FFT calls and zcopy calls (probably made by the FFT calls).

If the zcopies are blocking each other you would expect the increase of a factor of 2 (since the are only active 50% of the execution time and two processes will be able to execute the zcopies end to end without a significant slowdown).

We've verified this by only running two processes simultaneously and then the runtime of the processes is almost equal to the runtime of a single process.

This seems to indicate that, at least, the zcopy function is blocking the multi process use of the MKL library. Maybe there are more functions which exhibit the same behaviour

TimP · ‎01-26-2009

When you link with mkl_sequential, the OpenMP calls in MKL should be stubbed off. I understand there still may be thread library calls, but those should not have performance implications.
As your performance interaction question appears to be related to memory access, my next question would be whether you have arranged your affinity so as to avoid cache line conflicts between cores. I'm guessing that you have a system on which the BIOS numbers the cores an an alternating fashion. Then, for example, if core 0 and core 1 share the same cache line, you incur hit modified stalls as a cache line is modified on one socket and must be updated on the other. If you are running a recent linux,
/usr/sbin/irqbalance -debug
should show you which numbered cores share the same cache.

Ton_Kostelijk · ‎01-27-2009

Quoting - tim18

When you link with mkl_sequential, the OpenMP calls in MKL should be stubbed off. I understand there still may be thread library calls, but those should not have performance implications.
As your performance interaction question appears to be related to memory access, my next question would be whether you have arranged your affinity so as to avoid cache line conflicts between cores. I'm guessing that you have a system on which the BIOS numbers the cores an an alternating fashion. Then, for example, if core 0 and core 1 share the same cache line, you incur hit modified stalls as a cache line is modified on one socket and must be updated on the other. If you are running a recent linux,
/usr/sbin/irqbalance -debug
should show you which numbered cores share the same cache.

Tim,

We are running on Windows XP and we are using seperate processes, not 1 process with 4 threads. The cache performance values we measure with VTune are all within reasonable limits, so we don't suspect cache thrashing to be the issue.

It really seems to be a synchronisation blocking effect.

We've also reported this issue to premier support and haven't heard back from them in 6 days (beyond the acknowledgement that they were looking into it).

So this is either a serious issue or we are last in line >B')