have you already tried Apple code on Intel graphics? So what have inspired your question? Have you already seen better FFT performance on Gen via OpenCL than IPP one on CPU? Or you just have such expectations? Could you try (if not yet) and report your result here?
Thanks a lot for your feedback.
Ahead words about the motivation for the question:
We try to avoid extra graphic cards for our entry products, as this leads to extra profuct lifecycel and field maintenance costs. So a workstation with an on die graphics is attractive. Moreover as the graphics is only needed while system maintenance or debug.
With the advent of HD4000 and OpenCL support one can use the GPUs as coprocessor: I was able to balance the workload between a 2core CPU and HD4000 with ratio 1:3 -> I.e. I was able to gain a performance aequivalent of 6 additional cores for the case. Note, that in our case power consumption is not an issue at all.
Insight: Either use an 8 core with a simple internal GPU or use a smaller multi core with a strong internal GPU. The decission is supported by the INTEL libraries associated to those dies. And the price for those alternatives. And the roadmap of your products.
Actually there is a second way to calculate my algorithm using FFT. Buying this basic tool for intels GPU from an third party or making it on my own creates extra efforts, costs, dependencies in the SW processes and lifecycle management.
Test with the code from apple is upcoming. Will it show the same ratio for my case ?
Best regards, Stephan.
we are working on FFT for GPU (Gen), but currently our CPU 1-threaded version (IPP, 2D) is faster than GPU one up to 10th 2D FFT order - max that can fit into 1 GPU surface. According to your message above you've managed to get 1:3 - so which FFT implementation do you use for CPU? Is it fast enough and optimized? Have you tried IPP one?
The 1:3 ratio is from an algorithm which is texture based, but does not use FFTs. It interpolates many samples to write one output.
The second approach solving the calculation, using at least 2k times complex 2k 1D FFTs, has not been tested on GPU yet. The algo result itself still has issues, and I have to solve those first.
From my understanding, a FFT is read/write effort balanced, asking to gain full performance the data to be cached complete in L1/L2.
Matlab utilzes IPP, we have an own FFT lib, too. Both are very performant.
Hence the 1:3 ratio is not for all cases, but just for that special algorithm using textures, due to the 2D loaclity in a texture creates a lot of cache hits, data reuse and hence reduces transfer and latencies.
From your measurements and shared experience I derive that I should focus in the curent state of the art a multicore CPU to solve the FFT based approach. This is a helpfull advice from you already for me.
Best regards, Stephan