openmp FFT performance

hwilliams11 · ‎07-06-2011

Hi,

I'm trying to parallelize the computation of multiple ffts by using OpenMP to divide the vectors amongst threads. I have used the link advisor to link with the sequential library and all other libraries. This is what I read I should do when parallelizing descriptor creation and FFT computation at this link. The performance I am seeing is not what I expected. With two threads, I see a speedup of 1.83. With 4 threads, I see a speedup of 2.6.

Also I inserted some code to time different parts of the program. I have some code to compute the average time to compute one FFT. Using four threads actually causes the computation time to increase on average. With one thread on a vector size of 614000, the average time to compute the transform is .647 seconds. With 2 threads, it is not too bad at .699 seconds. But with 4 threads, the average time to compute one transformis .983 seconds.

I'm using the PGI Fortran compiler, and these are the libraries that I link with: mkl_intel_ilp64.lib mkl_sequential.lib mkl_core.lib and mkl_solver_ilp64_sequential.lib

I don't know if it has something to do with the libraries that I'm using or what, or maybe this speedup is normal and I'm being naive. I don't really know. I am a beginner, so I would appreciate any help that I could get. I could provide the timing info and some code snippets if needed.

Thanks!

Evgueni_P_Intel · ‎07-06-2011

There're many unknowns in your post: CPU andnumber of CPUs in your system, total number of vectors, precision, domain, input/output placement, MKL version (some update to 10.2?)
One explanation could be that 4 (?)vectors of 614000 points + internal MKL buffersdon't fit in the last level cache on your system and MKL FFTs have to access the main memory which is much slower than accessing the cache.
If that is the case, you may see better speedups with shorter vectors.

hwilliams11 · ‎07-07-2011

Sorry. I'm using a quadcore Intel Xeon E5530 2.39 GHz. I'm running 16x614000 vectors across 1, 2, and 4 threads. I'm using double precision and these are out of place transforms. The MKL version is 10.2 update 1.

Evgueni_P_Intel · ‎07-08-2011

Since the size of 16 vectors of 614000 points considerably exceeds the size of the last level cache, MKL performance that you see isaffected by the interconnect between the CPU and the main memory.
The Xeon E5530CPU can use 2 QPI links to the main memory.

When there're only 2 threads, each of them uses its own QPI link.
So both computation and memory access are sped up ~2x compared to the sequential case.

When there're 4 threads, each link is shared by 2 threads and only computation is sped up ~2x compared to the case with 2 threads, while memory access takes the same time as with 2 threads.
This is why the speedup for 4 threads is less than one may expect for your FFTs.

hwilliams11 · ‎07-22-2011

Thanks for that info. I meant to say that I'm using 16 vectors of 61400 points. But I did understand what you were saying that the number of points played a big factor. I've tried the program with different data sizes using64,000 total points up to 32miltotal pointsto see if my results were consistent. Below I've pasted the speedup when using 4 threads.

Size	Speedup-4
64K	3.7091546
128K	3.6893184
256K	3.4644231
512K	3.1200787
1M	1.5317248
2M	1.6484851
4M	2.2831603
8M	2.2191637
16M	2.3427168
32M	2.2812101

So at first the speedup is close to 4, but then it starts to decrease. I figured this is because of cache misses. But there is a big decrease in speedup at 1M and 2M, and then speedup increases again. I figured out the amount of memory that I'm using in the program for each different data size. At1M data samples total, I'm using 32MB of memory and 64MB at 2M samples. I just was wondering why the speedup was so bad at those two particular data set sizes. At 512K samples, I am using 16MB of memory in the program, so at that point I would be out of the cache, but why is there such a decrease at 1 and 2 million, but then the speedup increases again with larger datasets.

I used a performance profiler to see about cache misses and such, but the profiler does not show a dramatic increase in cache misses for those two data sizes.

Size	Cache Miss
64K	0.000
128K	0.000
256K	0.000
512K	0.199
1M	0.330
2M	0.378
4M	0.439
8M	0.458
16M	0.428
32M	0.499

Thank you for taking the time to answer my questions. I am a beginner at this topic so any help is greatly appreciated.

Evgueni_P_Intel · ‎07-25-2011

Thedropfrom 3.12x at 512K pointsto 1.5x at 1M points mainly corresponds to the cache boundary (8M for the E5530 CPU.)

The rest of your data just seems to correlate with the number of floating point operations per point performed by MKL.

In particular, the jump from 1.6x at 2M points to 2.4x at 4M points corresponds to the fact that at 4M points the lengths ofyour DFTs start to be divisible by 16 -- I take your M as 1000000.

The number of cache misses is rather the consequence of piping more and more data through the cache -- it isn't the root cause of the behavior that you see.