MKL FFT library performance vary from run to run by almost 100% difference

hello_world · ‎07-02-2013

Hi there,

I'm trying to use MKL 1D FFT library, e.g., I call 1M batch of size 1K FFT using MKL single precision.

If I just run the library call the performance was very steady and very fast, say, 0.3 seconds on my machine.

However, if I include the library call in my application, which is multi-threaded, the performance of the library call would vary from 0.3-0.6 seconds with 0.5 seconds occuring most often.

I was wondering if anyone else had experienced this or I was making mistakes and maybe there is a way to achieve good steady performance?

Thanks in advance!

TimP · ‎07-02-2013

If you are running on a platform without a single unified cache, it might be particularly important to make a suitable setting of KMP_AFFINITY, assuming no other jobs are running.

If you call the threaded MKL from a thread which is not OpenMP, you have the problems that you may over-subscribe the hardware thread support as well as your threading not being recognized by OpenMP. Then it would be your responsibility to control the number of threads in MKL as well as your application.

I hope these possibilities convince you that some specifics are needed.

hello_world · ‎07-02-2013

TimP (Intel) wrote:

If you are running on a platform without a single unified cache, it might be particularly important to make a suitable setting of KMP_AFFINITY, assuming no other jobs are running.

If you call the threaded MKL from a thread which is not OpenMP, you have the problems that you may over-subscribe the hardware thread support as well as your threading not being recognized by OpenMP. Then it would be your responsibility to control the number of threads in MKL as well as your application.

I hope these possibilities convince you that some specifics are needed.

While the batched FFT is running, no other jobs are running. The FFT was called by the main thread, but after that there is some parallel work using Phtreads.

I tried to modify the KMP_AFFINITY setting according to:

http://software.intel.com/en-us/articles/using-kmp-affinity-to-create-openmp-thread-mapping-to-os-proc-ids

setenv KMP_AFFINITY "verbose,granularity=fine,proclist=[0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15],explicit"

my machine has two Xeon E5-2690.

I also tried proclist=[0,2,4,6,8,10,12,14,1,3,5,7,9,11,13,15], but the performance still varies from 0.3 sec to 0.6 sec for both settings.

Could you please give some hint about a "suitable" setting? Hyperthreading has already been disabled on my machine. Thanks! :-)

Ying_H_Intel · ‎07-02-2013

Hi hello world,

Are you linking threaded mkl or sequential mkl? If sequential, then the AFFINITY is not needed

There are some factors, like memory alignment, FFT usage model, KMP_AFFINITY , time etc. like the article show http://software.intel.com/en-us/articles/tuning-the-intel-mkl-dft-functions-performance-on-intel-xeon-phi-coprocessors . It is for Xeon-phi processor, but we can refer to some of them, For example, how do you malloc the 1M data? each of the 1K data is aligned?

You mentioned, if just run the library call, the performance is stable, if include the library call in your application, the the performance vary, could you please show the usage model and time model or provide a simple test code?

Best Regards,

Ying

TimP · ‎07-02-2013

Your KMP_AFFINITY settings make some sense if you have disabled HyperThreading, but you would need to set OMP_NUM_THREADS so that you don't exceed 16 threads counting your simultaneously active pthreads. If you pinned your pthreads to specific cores, I believe you would want to omit those from the proclist. If you didn't pin the pthreads, you have a chance that the OpenMP mechanism will pin them to different cores from the MKL threads, and you could simply use KMP_AFFINITY=compact.

The hints Ying gave about aligning the buffers could be significant.

hello_world · ‎07-03-2013

TimP (Intel) wrote:

Your KMP_AFFINITY settings make some sense if you have disabled HyperThreading, but you would need to set OMP_NUM_THREADS so that you don't exceed 16 threads counting your simultaneously active pthreads. If you pinned your pthreads to specific cores, I believe you would want to omit those from the proclist. If you didn't pin the pthreads, you have a chance that the OpenMP mechanism will pin them to different cores from the MKL threads, and you could simply use KMP_AFFINITY=compact.

The hints Ying gave about aligning the buffers could be significant.

Thanks for your reply! I had a look at Ying's post and it seems that there are two reasons that may have affected the performance: 1) in my original code memory was 16-bytes aligned, rather than 64-bytes. 2) I used g++ rather than icpc.

With a combination of 64-bytes memory alignment and icpc -mkl -openmp gives more steady performance of the library than using g++.

using icpc gives 0.3-0.4 secs but using g++ gives 0.3-0.6 secs.

However specifying KMP_AFFNITY seems to degrade my pthread part code significantly.

My code is something like:

a) batched 1D Forward FFT using threaded MKL

b) pthread work

c) batched 1D inverse FFT using threaded MKL

1) if I don't specify all the cores for the OMP_NUM_THREADS, the MKL FFT didn't work to the full speed.

2) if I specify all the cores for the OMP_NUM_THREADS, my intermediate Pthread work is significantly slowed down.

hello_world · ‎07-03-2013

Ying H (Intel) wrote:

Hi hello world,

Are you linking threaded mkl or sequential mkl? If sequential, then the AFFINITY is not needed

There are some factors, like memory alignment, FFT usage model, KMP_AFFINITY , time etc. like the article show http://software.intel.com/en-us/articles/tuning-the-intel-mkl-dft-functions-performance-on-intel-xeon-phi-coprocessors . It is for Xeon-phi processor, but we can refer to some of them, For example, how do you malloc the 1M data? each of the 1K data is aligned?

You mentioned, if just run the library call, the performance is stable, if include the library call in your application, the the performance vary, could you please show the usage model and time model or provide a simple test code?

Best Regards,

Ying

Hi Ying,

I tried those optimization possibilities (if possibile) on the link you posted. It looks like if the memory is aligned to 64-bytes, the performance would be stready and reasonably good. I used to have it aligned to 16-bytes. Thanks!

I'm trying to see how Tip 5: using huge memory pages would affect the performance. :-)

Best Regards,

Jing

Ying_H_Intel · ‎08-06-2013

Hi Jing,

any result?

I had tried for Xeon Phi. As the OS had supported transparent huge page, the performance changes very slightly.

Best Regards,

Ying

hello_world · ‎08-06-2013

Hi Ying,

I tried to use huge page - but it needs root privilege and it may affect the performance of my other part of code.

So I tried other ways to get a steady good performance - icpc -mkl -openmp -Os gave me a pretty good performance and I just stick to it. :-)

Thanks for your help!!:-)

Best,

Jing

Ying H (Intel) wrote:

Hi Jing,

any result?

I had tried for Xeon Phi. As the OS had supported transparent huge page, the performance changes very slightly.

Best Regards,

Ying

SergeyKostrov · ‎08-06-2013

>>...However, if I include the library call in my application, which is multi-threaded, the performance of the library >>call would vary from 0.3-0.6 seconds with 0.5 seconds occuring most often... Please verify if Virtual Memory is used when execution slows. Also, it is Not clear what stack size value is set for OpenMP threads in your environment.

hello_world · ‎08-07-2013

Hi Sergey,

could you elaborate a little bit about the two points you made? Thanks!!

Best,

Jing

Sergey Kostrov wrote:

>>...However, if I include the library call in my application, which is multi-threaded, the performance of the library
>>call would vary from 0.3-0.6 seconds with 0.5 seconds occuring most often...

Please verify if Virtual Memory is used when execution slows. Also, it is Not clear what stack size value is set for OpenMP threads in your environment.

SergeyKostrov · ‎08-07-2013

>>>>Please verify if Virtual Memory is used when execution slows. Also, it is Not clear what stack size value is set for >>>>OpenMP threads in your environment... >> >>...could you elaborate a little bit about the two points you made? 1. Virtual Memory ( VM ) settings could be verified in a System applet of Control Panel for Windows. On Linux and a similar configuration utility needs to be called to verify VM settings. 2. At runtime stack size value for OpenMP threads could be checked as follows: ... #include "stdio.h" #include "stdlib.h" ... printf( "OMP_STACKSIZE=%s\n", getenv( "OMP_STACKSIZE" ) ); ...

SergeyKostrov · ‎08-09-2013

This is a follow up. >>2. At runtime stack size value for OpenMP threads could be checked as follows: >>... >>#include "stdio.h" >>#include "stdlib.h" >>... >>printf( "OMP_STACKSIZE=%s\n", getenv( "OMP_STACKSIZE" ) ); >>... And for KMP_STACKSIZE as follows: ... printf( "KMP_STACKSIZE=%s\n", getenv( "KMP_STACKSIZE" ) ); ...