I'm trying to parallelize the computation of multiple ffts by using OpenMP to divide the vectors amongst threads. I have used the link advisor to link with the sequential library and all other libraries. This is what I read I should do when parallelizing descriptor creation and FFT computation at this link. The performance I am seeing is not what I expected. With two threads, I see a speedup of 1.83. With 4 threads, I see a speedup of 2.6.
Also I inserted some code to time different parts of the program. I have some code to compute the average time to compute one FFT. Using four threads actually causes the computation time to increase on average. With one thread on a vector size of 614000, the average time to compute the transform is .647 seconds. With 2 threads, it is not too bad at .699 seconds. But with 4 threads, the average time to compute one transformis .983 seconds.
I'm using the PGI Fortran compiler, and these are the libraries that I link with: mkl_intel_ilp64.lib mkl_sequential.lib mkl_core.lib and mkl_solver_ilp64_sequential.lib
I don't know if it has something to do with the libraries that I'm using or what, or maybe this speedup is normal and I'm being naive. I don't really know. I am a beginner, so I would appreciate any help that I could get. I could provide the timing info and some code snippets if needed.
One explanation could be that 4 (?)vectors of 614000 points + internal MKL buffersdon't fit in the last level cache on your system and MKL FFTs have to access the main memory which is much slower than accessing the cache.
If that is the case, you may see better speedups with shorter vectors.
Since the size of 16 vectors of 614000 points considerably exceeds the size of the last level cache, MKL performance that you see isaffected by the interconnect between the CPU and the main memory.
The Xeon E5530CPU can use 2 QPI links to the main memory.
When there're only 2 threads, each of them uses its own QPI link.
So both computation and memory access are sped up ~2x compared to the sequential case.
When there're 4 threads, each link is shared by 2 threads and only computation is sped up ~2x compared to the case with 2 threads, while memory access takes the same time as with 2 threads.
This is why the speedup for 4 threads is less than one may expect for your FFTs.
So at first the speedup is close to 4, but then it starts to decrease. I figured this is because of cache misses. But there is a big decrease in speedup at 1M and 2M, and then speedup increases again. I figured out the amount of memory that I'm using in the program for each different data size. At1M data samples total, I'm using 32MB of memory and 64MB at 2M samples. I just was wondering why the speedup was so bad at those two particular data set sizes. At 512K samples, I am using 16MB of memory in the program, so at that point I would be out of the cache, but why is there such a decrease at 1 and 2 million, but then the speedup increases again with larger datasets.
I used a performance profiler to see about cache misses and such, but the profiler does not show a dramatic increase in cache misses for those two data sizes.
Thank you for taking the time to answer my questions. I am a beginner at this topic so any help is greatly appreciated.
The rest of your data just seems to correlate with the number of floating point operations per point performed by MKL.
In particular, the jump from 1.6x at 2M points to 2.4x at 4M points corresponds to the fact that at 4M points the lengths ofyour DFTs start to be divisible by 16 -- I take your M as 1000000.
The number of cache misses is rather the consequence of piping more and more data through the cache -- it isn't the root cause of the behavior that you see.