- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hello,
I want to find some methods to increase my 4096*4096 complex to complex DFT performance on MIC using the MKL-DFT. In the following website, I find the attract method:
Tip3: the leading dimension size for multiple –dimensional DFT can increase the DFT performance from 62.137836 GFLOPS to 103.776588 GFLOPS.
I am surprised at the increasing of the performance, so I want to try padding_leading_dim.But when I rum the example on my cpu ,I do not get the high performance, even the result doesn't changed a lot .
And I want to know that can I improve my DFT performance through the tip? Because I find the complex to complex may not need the input strides and output strides.
Best Regards!
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I have very few additions to the article.
- I would suggest you try performing in-place transform, since FFT is mostly bandwidth-limited workload, hence reducing memory footprint may significantly improve the performance.
- If you need Real-to-Complex transform then use it, instead of C2C, because it also reduce the amount of memory needed.
- You most likely know about optimized FFT-radices
- Depending on task size it may be beneficial to user-level parallelization or parallelization on DFTI level:
- if tasks are quite big (4Kx4K is probably big enough) then let DFTI use threads as it wants to do
- if there are a lot of small disjoint tasks then use parallelization in your code, using sequential DFTI
- Try using scatter thread pinning -- usually this increase memory throughput
- Try using `numactl -l` or `numactl -i all` in NUMA systems (2 sockets+). it usually stabilizes and improves performance.
By the way how many threads, what IA hardware and what MKL version do you use? I just may try benchmark the same problem internally and provide you our performance expectation from DFTI.
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
Could you please post your micro-benchmark here so that we can take a look on your measurements method?
Fixing a leading dimension is essential when stride from element [0][0] to element [1][0] is multiple of say 2, 4, 8 cache-lines. In this case when the algorithm performs column FFT a cache re-usage would be poor b/o cache associativity.
So if your task is 4096x4096 Single Precision Complex-to-Complex FFT, it should be beneficial to have column stride no 4096, but 4096+(64/sizeof(Single Complex)) = 4104.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I use the sample methods to calculate the performance:
#define REP_TIMES 100 gettimeofday(&t1,NULL); for(int i=0;i<REP_TIMES;i++) status = DftiComputeForward(handle,x,out); gettimeofday(&t2,NULL); timeuse = t2.tv_sec - t1.tv_sec + (t2.tv_usec - t1.tv_usec)/1000000.0; double perf= (REP_TIMES*2.5*N[0]*N[1]*log2(1.0*N[0]*N[1]))/timeinsec/1e9;
Do you mean I just need to pad the data from A[4096][4096] to A1[4104][4104]? Other element location will be padded 0 ?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
It doesn't matter what data you pad the region -- DFTI would not de-reference those values.
So the setup-code should look like as follows:
int rows = 4096, cols = 4096; int row_stride = 4104, col_stride = 1; int sizes[2] = {rows, cols}; int strides[3] = {0, row_stride, col_stride}; for (int r = 0; r < rows; ++r) { for (int c = 0; c < cols; ++c) { data[r*row_stride + c] = f(r, c); } } DftiCreateDescriptor(&h, DFTI_SINGLE, DFTI_COMPLEX, 2, sizes); DftiSetValue(h, DFTI_INPUT_STRIDES, strides); DftiSetValue(h, DFTI_OUTPUT_STRIDES, strides); DftiCommitDescriptor(h); timeit(DftiComputeForward(h, data));
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Thank you for your help!
Your code is very helpful to me. I have used your code and check the output data is correct!
I have learned that my Complex to Complex transform doesn't change my data layout, so the strides can be set easier than Real transform.
But I have a new question, comparing with no padding DFT, the padding increasing the performance a little .For example :
No padding 100 iteration DFT:
FFT tv_sec: 4.544067 s
performance: 22.153014 GFLOPS
Padding 100 iteration DFT:
FFT tv_sec: 4.297437 s
performance: 23.426413 GFLOPS
The part code is following: I don't calculate the copy time consume from indata[4096][4096] to indata[4104][4096] and the outdata1[4104][4096] to outdata[4096][4096].If the time is been calculated,
gettimeofday(&t1,NULL); for(int i=0;i<REP_TIMES;i++) { status = DftiComputeForward(handle,x,out); } gettimeofday(&t2,NULL); timersub(&t2, &t1, &diff); double timeinsec = (diff.tv_sec*1000+diff.tv_usec/1000)/1000.0;
Can you give me some advice to improve the performance ?I have been tried all methods mentioned by
Best Regards!
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I have very few additions to the article.
- I would suggest you try performing in-place transform, since FFT is mostly bandwidth-limited workload, hence reducing memory footprint may significantly improve the performance.
- If you need Real-to-Complex transform then use it, instead of C2C, because it also reduce the amount of memory needed.
- You most likely know about optimized FFT-radices
- Depending on task size it may be beneficial to user-level parallelization or parallelization on DFTI level:
- if tasks are quite big (4Kx4K is probably big enough) then let DFTI use threads as it wants to do
- if there are a lot of small disjoint tasks then use parallelization in your code, using sequential DFTI
- Try using scatter thread pinning -- usually this increase memory throughput
- Try using `numactl -l` or `numactl -i all` in NUMA systems (2 sockets+). it usually stabilizes and improves performance.
By the way how many threads, what IA hardware and what MKL version do you use? I just may try benchmark the same problem internally and provide you our performance expectation from DFTI.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Very good, these advice is very important for me , and I will consider these carefully.
The following is some parameters that I used in my MIC offload program:
performance: 22.857243 GFLOPS
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page