Poor seed up with multithreaded mkl dft functions

jul_bod · ‎05-10-2007

Hi everybody, In the CFD code I am writing at the moment, I want to use mkl dft function multithreaded. To improve performance I don't want to use internal multithreading, and simply divide an array of N lines in n blocks for n processors and perform N/n fft (each line) on each subarray instead of N with one processor. I am working on multicores architectures (duo cores and double duo cores) Unfortunately the resulting speed up is really poor (~3 times faster for 4 processors) and I really need good scalability. Using vtune, it ends up that a function called "mkl_serv_lock" is used a lot in the multithreaded case. From what i have read in this forum and in google, i am wondering if it is due to "false sharing". Indeed the big array has to be a shared data as fft has to be done in multiple direction. Do you think i am right, is there a solution to get almost perfect speedup ratio?

TimP · ‎05-10-2007

In many CFD applications, a speed up of 3 for 4 threads would be considered excellent. If you actually had false sharing, your results would likely be much worse. As long as each thread is storing data at least 2 cache lines away from data in use by other threads, you don't actually have false sharing. You could be in a less severe butanalogous situation if a thread writes into and invalidates a cache line which will soon be neededon another cache.

The most severe such situation, sort of actual false sharing,would be where hardware prefetch is hindered by another thread writing into a cache line and invalidating it after it is prefetched. That might require threads to stay 4 or more cache lines apart, to make hw prefetch into an advantage rather than a hindrance.

Are you using KMP_AFFINITY (for Intel OpenMP; for gnu OpenMP there is GOMP_AFFINITY)? If you don't use a method to attempt to keep your threads properly sorted on the cores, you could easily lose 10% performance. I think the affinity schemes are more effective on linux than on Windows; if you can use a linux release such as EL4.4, CentOS 4.4 or newer, it may be worth while with the goals you have set.

With VTune, you could collect bus events and cache misses, to see if those are implicated in limiting your scaling. Intel Thread Profiler alsomight shed light on it. If you have balanced the work among the threads, applied affinity, but still have time spend waiting for the laggard thread to complete, it could be the slower threads are spending extra time moving data which is dirty in one cache when required in another. In the more intractable cases, OpenMP dynamic scheduling with optimization of chunk sizemay help overcome such delays, but you might have no chance of making your goal in such a case.

jul_bod · ‎05-11-2007

Thanks a lot for your rapid answer.

I made new tests today on a 4 duo core opteron machine. I set affinity mask and indeed it improves performances a bit. Unfortunately there is no speed up any more when using more than 4 processors. As I'm doing DNS computation, scalability is critical as grids are huge and computational time as well. (I need at least a linear speed up ratio)
That's why I want to use openmp in "mpi like" mode ( n threads from the beginning till the end).

From vtune I sampled some L2 cache lock events. It appears that there is much more (100-1000 times) events in the multithreaded case. Those L2 locked access are always in the "exclusive state".

I am not a specialist at all in computer architecture and I don't really know how to interpret this. Can you help me?

Thanks a lot!

PS: my code is really simple:
- a 128^3 grid:
- 128^2/n_thread fft in one direction (length=128) for each thread.

Intel_C_Intel · ‎05-23-2007

Your use of the MKL FFT is fine. When you set up the calls to the FFT functions, what is the number of threads you use? Since you are threading yourself, and you don't want MKL to thread the FFTs (and which should, in general, give you more performance) the number of threads should be set to 1. If that is the case and you are still getting the locks, we would like to know about this.

Bruce

jul_bod · ‎06-04-2007

Hi,

I think I finally tried all possible methods to thread my problem. Unfortunately, even with the simplest case, the speed up does not appear to be good enough. Indeed I only reach a speed up ratio of 5.5 on 8 processors with relatively big cases. This might look not bad, except when your simulation could run one month on 32 processors.

I'm still getting some lock when looking with vtune, but I can't say if the number of occurency is significant or not.

I invite anyone who wants to help me to see the source code of my simple test, the script and the compilation option I was using on the following link.

Thanks a lot!

Intel_C_Intel · ‎06-05-2007

In general, 5.5/8 on FFT is quite good. Memory reuse is problematic on the FFT since there are O( n log(n) ) computations on n data points. Reuse thus is on the order of log(n) which is quite small. Compare that to matrix multiply where there are O(n^3) operations on O(n^2) data references (amount of data moved into cache), for a reuse factor of n.

Bruce

jul_bod · ‎06-06-2007

Maybe I am wrong, but I think you are talking about scaling a single fft. I am just trying to scale multiple FFT which means executing n independant jobs by n processors on a shared variable, and I can't get why the scaling is not almost perfect.

Can it be due to opteron architecture?

Intel_C_Intel · ‎06-11-2007

I understand your environment, but my statement still holds. Since the number of operations is ~5*n*log(n), the reuse of data is small which limits the amount of speedup that you can get. Running individual FFTs on each thread eliminates the issues with cooperative parallelization of the code, but does nothing to improve the ratio of arithmetic operations per data point.

Bruce

jul_bod · ‎06-16-2007

Ok I think you're right. I found on internet the NASA benchmarks NPB, and indeed the FT, which is almost my problem does not scale well du to memory bandwith limitation.
So if I understand well, there is too many memory access in this problem due to the small amount of data reuse as you said.
The problem is then closed... Thanks anyway for helping me

randoob1 · ‎09-04-2007

When reuse of data is large, how do multiple-threads improve the ratio of arithmetic operations per data point? Do the threads somehow share the data in cache?

MDavi · ‎09-25-2007

Another question:
According to the MKL manual, FTTs can be used in 3 ways:
1. You do not create threads and let the FFT do it.
2. You create threads yourself but the number of threads using each FFT descriptor is one.
3. You want one FFT to be used by several user threads.
In the latter case you need to set DFTI_NUMBER_OF_USER_THREADS to the actual number of threads the FFT wil be used with.

However my module is a plug-in to a 3rd party host application, which generates the threads for me and call my FFT function. I am unable to know in advance how many threads this host-app will generate on my FFT. Trying to set this attribute to arbitrary numbers like 2 or 8 crashes the MKL on an 8-core mahcine upon application quit.

If I set the number of user threads to 1, which is the defaults, I get extreme degradation in performance.

Any Idea?

Thanks,

Itai