- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi there,
I'm trying to use MKL 1D FFT library, e.g., I call 1M batch of size 1K FFT using MKL single precision.
If I just run the library call the performance was very steady and very fast, say, 0.3 seconds on my machine.
However, if I include the library call in my application, which is multi-threaded, the performance of the library call would vary from 0.3-0.6 seconds with 0.5 seconds occuring most often.
I was wondering if anyone else had experienced this or I was making mistakes and maybe there is a way to achieve good steady performance?
Thanks in advance!
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
If you are running on a platform without a single unified cache, it might be particularly important to make a suitable setting of KMP_AFFINITY, assuming no other jobs are running.
If you call the threaded MKL from a thread which is not OpenMP, you have the problems that you may over-subscribe the hardware thread support as well as your threading not being recognized by OpenMP. Then it would be your responsibility to control the number of threads in MKL as well as your application.
I hope these possibilities convince you that some specifics are needed.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
TimP (Intel) wrote:
If you are running on a platform without a single unified cache, it might be particularly important to make a suitable setting of KMP_AFFINITY, assuming no other jobs are running.
If you call the threaded MKL from a thread which is not OpenMP, you have the problems that you may over-subscribe the hardware thread support as well as your threading not being recognized by OpenMP. Then it would be your responsibility to control the number of threads in MKL as well as your application.
I hope these possibilities convince you that some specifics are needed.
While the batched FFT is running, no other jobs are running. The FFT was called by the main thread, but after that there is some parallel work using Phtreads.
I tried to modify the KMP_AFFINITY setting according to:
setenv KMP_AFFINITY "verbose,granularity=fine,proclist=[0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15],explicit"
my machine has two Xeon E5-2690.
I also tried proclist=[0,2,4,6,8,10,12,14,1,3,5,7,9,11,13,15], but the performance still varies from 0.3 sec to 0.6 sec for both settings.
Could you please give some hint about a "suitable" setting? Hyperthreading has already been disabled on my machine. Thanks! :-)
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi hello world,
Are you linking threaded mkl or sequential mkl? If sequential, then the AFFINITY is not needed
There are some factors, like memory alignment, FFT usage model, KMP_AFFINITY , time etc. like the article show http://software.intel.com/en-us/articles/tuning-the-intel-mkl-dft-functions-performance-on-intel-xeon-phi-coprocessors . It is for Xeon-phi processor, but we can refer to some of them, For example, how do you malloc the 1M data? each of the 1K data is aligned?
You mentioned, if just run the library call, the performance is stable, if include the library call in your application, the the performance vary, could you please show the usage model and time model or provide a simple test code?
Best Regards,
Ying
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Your KMP_AFFINITY settings make some sense if you have disabled HyperThreading, but you would need to set OMP_NUM_THREADS so that you don't exceed 16 threads counting your simultaneously active pthreads. If you pinned your pthreads to specific cores, I believe you would want to omit those from the proclist. If you didn't pin the pthreads, you have a chance that the OpenMP mechanism will pin them to different cores from the MKL threads, and you could simply use KMP_AFFINITY=compact.
The hints Ying gave about aligning the buffers could be significant.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
TimP (Intel) wrote:
Your KMP_AFFINITY settings make some sense if you have disabled HyperThreading, but you would need to set OMP_NUM_THREADS so that you don't exceed 16 threads counting your simultaneously active pthreads. If you pinned your pthreads to specific cores, I believe you would want to omit those from the proclist. If you didn't pin the pthreads, you have a chance that the OpenMP mechanism will pin them to different cores from the MKL threads, and you could simply use KMP_AFFINITY=compact.
The hints Ying gave about aligning the buffers could be significant.
Thanks for your reply! I had a look at Ying's post and it seems that there are two reasons that may have affected the performance: 1) in my original code memory was 16-bytes aligned, rather than 64-bytes. 2) I used g++ rather than icpc.
With a combination of 64-bytes memory alignment and icpc -mkl -openmp gives more steady performance of the library than using g++.
using icpc gives 0.3-0.4 secs but using g++ gives 0.3-0.6 secs.
However specifying KMP_AFFNITY seems to degrade my pthread part code significantly.
My code is something like:
a) batched 1D Forward FFT using threaded MKL
b) pthread work
c) batched 1D inverse FFT using threaded MKL
1) if I don't specify all the cores for the OMP_NUM_THREADS, the MKL FFT didn't work to the full speed.
2) if I specify all the cores for the OMP_NUM_THREADS, my intermediate Pthread work is significantly slowed down.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Ying H (Intel) wrote:
Hi hello world,
Are you linking threaded mkl or sequential mkl? If sequential, then the AFFINITY is not needed
There are some factors, like memory alignment, FFT usage model, KMP_AFFINITY , time etc. like the article show http://software.intel.com/en-us/articles/tuning-the-intel-mkl-dft-functions-performance-on-intel-xeon-phi-coprocessors . It is for Xeon-phi processor, but we can refer to some of them, For example, how do you malloc the 1M data? each of the 1K data is aligned?
You mentioned, if just run the library call, the performance is stable, if include the library call in your application, the the performance vary, could you please show the usage model and time model or provide a simple test code?
Best Regards,
Ying
Hi Ying,
I tried those optimization possibilities (if possibile) on the link you posted. It looks like if the memory is aligned to 64-bytes, the performance would be stready and reasonably good. I used to have it aligned to 16-bytes. Thanks!
I'm trying to see how Tip 5: using huge memory pages would affect the performance. :-)
Best Regards,
Jing
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Jing,
any result?
I had tried for Xeon Phi. As the OS had supported transparent huge page, the performance changes very slightly.
Best Regards,
Ying
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Ying,
I tried to use huge page - but it needs root privilege and it may affect the performance of my other part of code.
So I tried other ways to get a steady good performance - icpc -mkl -openmp -Os gave me a pretty good performance and I just stick to it. :-)
Thanks for your help!!:-)
Best,
Jing
Ying H (Intel) wrote:
Hi Jing,
any result?
I had tried for Xeon Phi. As the OS had supported transparent huge page, the performance changes very slightly.
Best Regards,
Ying
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Sergey,
could you elaborate a little bit about the two points you made? Thanks!!
Best,
Jing
Sergey Kostrov wrote:
>>...However, if I include the library call in my application, which is multi-threaded, the performance of the library
>>call would vary from 0.3-0.6 seconds with 0.5 seconds occuring most often...Please verify if Virtual Memory is used when execution slows. Also, it is Not clear what stack size value is set for OpenMP threads in your environment.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page