topic Hi Mathieu! in Intel® oneAPI Math Kernel Library

MKL fftw3 thread safety

MGRAV — Mon, 05 Mar 2018 09:05:28 GMT

Is mkl fftw3 wrapper completely thread safe ?
I suppose that it respects at least fftw3 thread safety. That mean basically everything, but not the plan creation.

MKL interface makes the plan creation thread safe ? If we need to create a plan for each thread of KNL with a lock around the plan creation, it will take ages !

Hi Mathieu，

Ying_H_Intel — Mon, 12 Mar 2018 01:42:00 GMT

Hi Mathieu，

The answer looks a little complex. Let's analyse the situation :

First factor: in FFTW website it claim some consideration about thread-safety of the fftw_execute function

All other routines (e.g. the planner) should only be called from one thread at a time. So, for example, you can wrap a semaphore lock around any calls to the planner; even more simply, you can just create all of your plans from one thread. We do not think this should be an important restriction (FFTW is designed for the situation where the only performance-sensitive code is the actual execution of the transform), and the benefits of shared data between plans are great.

The FFTW planner is intended to be called from a single thread. If you really must call it from multiple threads, you are expected to grab whatever lock makes sense for your application, with the understanding that you may be holding that lock for a long time, which is undesirable.

Which means FFTW planner is called from a single thread, then thread-safety.

Second factor: In mkl user guide: intel MKL is thread-safe, (except the LAPACK deprecated routine ?lacon) work correctly during simultaneous execution by multiple threads. In particular, any chunk of threaded Intel MKL code provides access for multiple threads to the same shared data, while permitting only one thread at any given time to access a shared piece of data. Therefore, you can call Intel MKL from multiple threads and not worry about the function instances interfering with each other.

For FFTW wrapper, we haven't changed the functionality of FFTW wrapper planner part, if the FFTW plan was implemented in sequential,then there is no thread-safe issue. .

So the question may be how do you implement your multi thread? Could you please describle your FFTW usage scene?

for example, If it is MPI , then there is no thread-safety problem. And you mentioned "If we need to create a plan for each thread of KNL", how many thread do you compute at the same time and how do you link MKL?

Best Regards,

Ying

Hi Ying,

MGRAV — Mon, 12 Mar 2018 14:46:32 GMT

Hi Ying,

thanks for your answer and all this information, with a special thanks to point out to me that "destroy_plan" is neither thread-safe

Currently, the multithreading is implemented with OpenMP, under some specific condition I can use fftw_execute_dft for each thread, with a single plan (when all the data have the same size). But in the general approach, each thread has his own plan where the size of the data can be different for each thread. Currently, I use an OMP critical section, so basically a lock, over 3 plans that I need to have for each thread.

The creation of each plan is linked to his memory allocation and initialization - with first touch policy-, and done with the thread that will use it later.

The number of threads depends on the size of the problem, that can be nD and data can start from 5k, up to 30M, or even more. I have many parallelization levels, bigger is the data, more I parallelize the FFT to don’t exceed MCDRAM capacity.

Test shows that for small 2D image like 256*256 128 threads is slightly more efficient than 64. However, the increase in numbers of plans (in critical section) destroy all benefit.

I am not sure to well understanding:

"For FFTW wrapper, we haven't changed the functionality of FFTW wrapper planner part, if the FFTW plan was implemented in sequential,then there is no thread-safe issue."

I suppose that the FFTW wrapper use in background DftiCreateDescriptor, no ? I suppose that DftiCreateDescriptor is thread-safe, no ?

Hi Mathieu,

Ying_H_Intel — Tue, 13 Mar 2018 08:53:32 GMT

Hi Mathieu,

Right, the FFT wrapper use in background dfticreateDescriptor. the related part should be thread-safe.

You mentioned, under some specific condition I can use fftw_execute_dft for each thread, with a single plan (when all the data have the same size). But in the general approach, each thread has his own plan where the size of the data can be different for each thread, if thus, why the lock needed, it supposed ok to use in parallel.

moreover, how do you link MKL sequential or parallel and which OpenMP (for example intel implemented openmp or other) ?

Best Regards,

Ying

Hi Ying,

MGRAV — Tue, 13 Mar 2018 14:27:18 GMT

Hi Ying,

The execution is always thread-safe regarding FFTW documentation.

My unique thread-safety-issues is just the creation of plans.

I link with the default mkl, so parallel version, and I use parallelized FFT too.

Best,

Mathieu

Hi Mathieu!

Dmitry_Z_Intel1 — Tue, 24 Apr 2018 04:00:46 GMT

Hi Mathieu!

>Is mkl fftw3 wrapper completely thread safe ?
Generally speaking - NO. But there are cases when plan creation will work correctly.

First of all, a pointer-to-the-plan should be defined for each thread personally. In other words, fftw_paln should be defined INSIDE a custom OMP loop, otherwise the behavior is undefined. This is a requirement.

The only shared object during fftw plan creation in Intel(R) MKL FFTW3 wrappers is a special structure defined in fftw_version.c file. The critical variable there is nthreads - the number of threads used during plan computation. Both thread-safety and functional correctness depend on the value of this variable.

First case: user wants to run each plan with one thread only (sequential) case. This should work. We recommend to link an application with mkl_sequential library to avoid possible side effects.

Second case: user wants to run each plan with a constant number of several threads (OMP nested) case. This case may work under limitations. To activate several threads for a compute section in a custom OMP loop, user needs to link an application with the mkl_xxx_thread library and do a call to the "mkl_set_num_threads_local(numThreads)" function. Otherwise, the behavior should be the same as in first case. numThreads should be the same for all threads. It's desired to avoid an over-subscription that's why:
(user's # of threads for OMP loop) * (# of Intel(R) MKL threads) <= (# of available machine threads);
If this configuration is setup, user is also required to set environment variable named "OMP_NESTED=true". Otherwise, functional correctness is not guaranteed.

Third case: user wants to run each plan with different number of several threads (complicated OMP nested) case. This case doesn't work because the actual number of threads will be overwritten after each thread creates a new plan. The behavior of this configuration is undefined.

Intel(R) MKL FFT team may provide examples that describe both cases and how to write them in one of the next releases by request.

Thank you.

Hi Dimitry,

MGRAV — Mon, 30 Apr 2018 10:56:01 GMT

Hi Dimitry,

>Generally speaking - NO. But there are cases when plan creation will work correctly.

I like this type of answer !

>First of all, a pointer-to-the-plan should be defined for each thread personally. In other words, fftw_paln should be defined INSIDE a custom OMP loop, otherwise the behavior is undefined. This is a requirement.

What does mean exactly, I cannot create a plan with another thread ? Memory is memory, no ? Or just I need to have a different plan for each thread, but I can create all of them in the main thread.
Currently, I have the creation of plans and their usage in two different OMP section, I don't see why it would-be an issue.

>First case: user wants to run each plan with one thread only (sequential) case. This should work. We recommend to link an application with mkl_sequential library to avoid possible side effects.

Performance aren't similar between mkl_sequential and mkl_parallel using a single thread ?

Should I use mkl_set_num_threads_local() in each thread, or can I set it in the main thread before the OMP section, and have the value affecting each thread ?
Interestingly mkl_set_num_threads() work fine for me !

Thanks a lot for your help,

Mathieu

Ying_H_Intel — Thu, 03 May 2018 08:12:21 GMT

Hi Mathieu,

>What does mean exactly, I cannot create a plan with another thread ? Memory is memory, no ? Or just I need to have a different plan for each thread, but I can create all of them in the main thread.

[Ying]Yes, it should be ok for your create a plan with another thread. Dmitry just want to emphasize each thread need his own plan.

>Performance aren't similar between mkl_sequential and mkl_parallel using a single thread ?

Should I use mkl_set_num_threads_local() in each thread, or can I set it in the main thread before the OMP section, and have the value affecting each thread ?

[Ying] the performance should be similar when 1 thread.

mkl_set_num_threads_local did the trick. it should be ok. No sure what number you setting, but please refer to mkl developer guide

CAUTION:
If your application is threaded with OpenMP* and parallelization of Intel MKL is based on nested
OpenMP parallelism, different OpenMP parallel regions reuse OpenMP threads. Therefore a thread-local
setting in one OpenMP parallel region may continue to affect not only the master thread after the
parallel region ends, but also subsequent parallel regions. To avoid performance implications of this
side effect, reset the thread-local number of threads before leaving the OpenMP parallel region (see
Examples for how to do it).

This example shows how to avoid the side effect of a thread-local number of threads by reverting to the
global setting:
#include "omp.h"
#include "mkl.h"
…
mkl_set_num_threads(16);
my_compute_using_mkl(); // Intel MKL functions use up to 16 threads
#pragma omp parallel num_threads(2)
{
if (0 == omp_get_thread_num())
mkl_set_num_threads_local(4);
else
mkl_set_num_threads_local(12);
my_compute_using_mkl(); // Intel MKL functions use up to 4 threads on thread 0
// and up to 12 threads on thread 1
}
my_compute_using_mkl(); // Intel MKL functions use up to 4 threads (!)
mkl_set_num_threads_local( 0 ); // make master thread use global setting
my_compute_using_mkl(); // Intel MKL functions use up to 16 threads
This example shows how to avoid the side effect of a thread-local number of threads by saving and restoring
the existing setting:
#include "mkl.h"
void my_compute( int nt )
{
int save = mkl_set_num_threads_local( nt ); // save the Intel MKL number of threads
my_compute_using_mkl(); // Intel MKL functions use up to nt threads on this thread
mkl_set_num_threads_local( save ); // restore the Intel MKL number of threads
}
Best Regards,
Ying