Intel® oneAPI Math Kernel Library
Ask questions and share information with other developers who use Intel® Math Kernel Library.

openmp FFT performance

hwilliams11
Beginner
561 Views

Hi,

I'm trying to parallelize the computation of multiple ffts by using OpenMP to divide the vectors amongst threads. I have used the link advisor to link with the sequential library and all other libraries. This is what I read I should do when parallelizing descriptor creation and FFT computation at this link. The performance I am seeing is not what I expected. With two threads, I see a speedup of 1.83. With 4 threads, I see a speedup of 2.6.

Also I inserted some code to time different parts of the program. I have some code to compute the average time to compute one FFT. Using four threads actually causes the computation time to increase on average. With one thread on a vector size of 614000, the average time to compute the transform is .647 seconds. With 2 threads, it is not too bad at .699 seconds. But with 4 threads, the average time to compute one transformis .983 seconds.

I'm using the PGI Fortran compiler, and these are the libraries that I link with: mkl_intel_ilp64.lib mkl_sequential.lib mkl_core.lib and mkl_solver_ilp64_sequential.lib

I don't know if it has something to do with the libraries that I'm using or what, or maybe this speedup is normal and I'm being naive. I don't really know. I am a beginner, so I would appreciate any help that I could get. I could provide the timing info and some code snippets if needed.

Thanks!

0 Kudos
5 Replies
Evgueni_P_Intel
Employee
561 Views
There're many unknowns in your post: CPU andnumber of CPUs in your system, total number of vectors, precision, domain, input/output placement, MKL version (some update to 10.2?)
One explanation could be that 4 (?)vectors of 614000 points + internal MKL buffersdon't fit in the last level cache on your system and MKL FFTs have to access the main memory which is much slower than accessing the cache.
If that is the case, you may see better speedups with shorter vectors.
0 Kudos
hwilliams11
Beginner
561 Views
Sorry. I'm using a quadcore Intel Xeon E5530 2.39 GHz. I'm running 16x614000 vectors across 1, 2, and 4 threads. I'm using double precision and these are out of place transforms. The MKL version is 10.2 update 1.
0 Kudos
Evgueni_P_Intel
Employee
561 Views

Since the size of 16 vectors of 614000 points considerably exceeds the size of the last level cache, MKL performance that you see isaffected by the interconnect between the CPU and the main memory.
The Xeon E5530CPU can use 2 QPI links to the main memory.

When there're only 2 threads, each of them uses its own QPI link.
So both computation and memory access are sped up ~2x compared to the sequential case.

When there're 4 threads, each link is shared by 2 threads and only computation is sped up ~2x compared to the case with 2 threads, while memory access takes the same time as with 2 threads.
This is why the speedup for 4 threads is less than one may expect for your FFTs.

0 Kudos
hwilliams11
Beginner
561 Views
Thanks for that info. I meant to say that I'm using 16 vectors of 61400 points. But I did understand what you were saying that the number of points played a big factor. I've tried the program with different data sizes using64,000 total points up to 32miltotal pointsto see if my results were consistent. Below I've pasted the speedup when using 4 threads.

Size

Speedup-4

64K

3.7091546

128K

3.6893184

256K

3.4644231

512K

3.1200787

1M

1.5317248

2M

1.6484851

4M

2.2831603

8M

2.2191637

16M

2.3427168

32M

2.2812101



So at first the speedup is close to 4, but then it starts to decrease. I figured this is because of cache misses. But there is a big decrease in speedup at 1M and 2M, and then speedup increases again. I figured out the amount of memory that I'm using in the program for each different data size. At1M data samples total, I'm using 32MB of memory and 64MB at 2M samples. I just was wondering why the speedup was so bad at those two particular data set sizes. At 512K samples, I am using 16MB of memory in the program, so at that point I would be out of the cache, but why is there such a decrease at 1 and 2 million, but then the speedup increases again with larger datasets.

I used a performance profiler to see about cache misses and such, but the profiler does not show a dramatic increase in cache misses for those two data sizes.

Size

Cache Miss

64K

0.000

128K

0.000

256K

0.000

512K

0.199

1M

0.330

2M

0.378

4M

0.439

8M

0.458

16M

0.428

32M

0.499


Thank you for taking the time to answer my questions. I am a beginner at this topic so any help is greatly appreciated.

0 Kudos
Evgueni_P_Intel
Employee
561 Views
Thedropfrom 3.12x at 512K pointsto 1.5x at 1M points mainly corresponds to the cache boundary (8M for the E5530 CPU.)

The rest of your data just seems to correlate with the number of floating point operations per point performed by MKL.

In particular, the jump from 1.6x at 2M points to 2.4x at 4M points corresponds to the fact that at 4M points the lengths ofyour DFTs start to be divisible by 16 -- I take your M as 1000000.

The number of cache misses is rather the consequence of piping more and more data through the cache -- it isn't the root cause of the behavior that you see.
0 Kudos
Reply