I am working on a 3D numerical integrator for a non-linear PDE using the parallel FFT library included in MKL.
My arrays consist of 2^30 data points which is much much larger than the cache. This results in ~50% of cache references being misses leading to a massive amount of execution time being purely accessing memory.
Is there a clever way I can deal with this? Is it expected to have 50% cache misses using an array this large?
Any help would be much appreciated.
Hi Dilan B.,
Cache-miss rate of 50% is OK for large out-of-place FFTs. Did you try in-place 3D transforms?
For most data points of large 3D transforms, the miss-hit pattern is MHHHMH for in-place transforms and MMHHMH for out-of-place transforms -- 33% and 50% cache-miss rate. Though real figures may be higher, switching to in-lpace transforms may improve performance.
Thanks for the prompt reply.
I tried using in-place transforms and it improved the cache miss rate by approximately 5% compared to out-of-place transforms. I am still finding my performance underwhelming compared to a solver using FFTW3 I had written in the past and I am completely stumped on how or if I can further increase performance
I have also noticed that certain runs can have a cache miss rate of as high as 65% with no changing of parameters in my source code.
Thanks for your help,
FFT performance may depend on the layout of the dataset in the memory, threading runtime settings, etc.
To speedup investigation, please post a reproducer here or send it privately.