Community
cancel
Showing results for 
Search instead for 
Did you mean: 
王_子_
Beginner
108 Views

About the set of padding_leading_dim or input(output) strides for (complex to complex )to increase the performance DFT on MIC

Jump to solution

Hello,

I want to find some methods to increase my 4096*4096 complex to complex DFT performance on MIC using the MKL-DFT. In the following website, I find the attract method:

Tip3: the leading dimension size for multiple –dimensional DFT  can increase the DFT performance from 62.137836 GFLOPS to 103.776588 GFLOPS.

( https://software.intel.com/en-us/articles/tuning-the-intel-mkl-dft-functions-performance-on-intel-xe...

I am surprised at the increasing of the performance, so I want to try padding_leading_dim.But when I rum the example on my cpu ,I do not get the high performance, even the result doesn't changed a lot .

And I want to know that can I improve my DFT performance through the tip? Because I find the complex to complex may not need the input strides and output strides.

Best Regards!

0 Kudos
1 Solution
Evarist_F_Intel
Employee
108 Views

I have very few additions to the article.

  • I would suggest you try performing in-place transform, since FFT is mostly bandwidth-limited workload, hence reducing memory footprint may significantly improve the performance.
  • If you need Real-to-Complex transform then use it, instead of C2C, because it also reduce the amount of memory needed.
  • You most likely know about optimized FFT-radices
  • Depending on task size it may be beneficial to user-level parallelization or parallelization on DFTI level:
    • if tasks are quite big (4Kx4K is probably big enough) then let DFTI use threads as it wants to do
    • if there are a lot of small disjoint tasks then use parallelization in your code, using sequential DFTI
  • Try using scatter thread pinning -- usually this increase memory throughput
  • Try using `numactl -l` or `numactl -i all` in NUMA systems (2 sockets+). it usually stabilizes and improves performance.

By the way how many threads, what IA hardware and what MKL version do you use? I just may try benchmark the same problem internally and provide you our performance expectation from DFTI.

View solution in original post

6 Replies
Evarist_F_Intel
Employee
108 Views

Hi,

Could you please post your micro-benchmark here so that we can take a look on your measurements method?

Fixing a leading dimension is essential when stride from element [0][0] to element [1][0] is multiple of say 2, 4, 8 cache-lines. In this case when the algorithm performs column FFT a cache re-usage would be poor b/o cache associativity.
So if your task is 4096x4096 Single Precision Complex-to-Complex FFT, it should be beneficial to have column stride no 4096, but 4096+(64/sizeof(Single Complex)) = 4104.

王_子_
Beginner
108 Views

I use the sample methods to calculate  the performance:

#define REP_TIMES 100

gettimeofday(&t1,NULL);
for(int i=0;i<REP_TIMES;i++)
	status = DftiComputeForward(handle,x,out);
gettimeofday(&t2,NULL);
   
timeuse = t2.tv_sec - t1.tv_sec + (t2.tv_usec - t1.tv_usec)/1000000.0;

double perf= (REP_TIMES*2.5*N[0]*N[1]*log2(1.0*N[0]*N[1]))/timeinsec/1e9;

Do you mean I just need to pad the data from A[4096][4096] to A1[4104][4104]? Other element location will be padded 0 ?

Evarist_F_Intel
Employee
108 Views

It doesn't matter what data you pad the region -- DFTI would not de-reference those values.

So the setup-code should look like as follows:

int rows = 4096, cols = 4096;
int row_stride = 4104, col_stride = 1;
int sizes[2] = {rows, cols};
int strides[3] = {0, row_stride, col_stride};

for (int r = 0; r < rows; ++r) {
    for (int c = 0; c < cols; ++c) {
        data[r*row_stride + c] = f(r, c);
    }
}

DftiCreateDescriptor(&h, DFTI_SINGLE, DFTI_COMPLEX, 2, sizes);
DftiSetValue(h, DFTI_INPUT_STRIDES, strides);
DftiSetValue(h, DFTI_OUTPUT_STRIDES, strides);
DftiCommitDescriptor(h);

timeit(DftiComputeForward(h, data));


 

王_子_
Beginner
108 Views

Thank you for your help!

Your code is very helpful to me. I have used your code and check the output data is correct!

I have learned that my Complex to Complex transform doesn't change my data layout, so the strides can be set easier than Real transform.

But I have a new question, comparing with no padding DFT, the padding increasing the performance a little .For example :

No padding 100 iteration DFT:

FFT tv_sec:  4.544067 s 
performance: 22.153014 GFLOPS

Padding 100  iteration DFT: 

FFT tv_sec:  4.297437 s 
performance: 23.426413 GFLOPS

The part code is following: I don't calculate the copy time consume from indata[4096][4096] to indata[4104][4096]  and the outdata1[4104][4096] to outdata[4096][4096].If the time is been calculated,

		gettimeofday(&t1,NULL);
		for(int i=0;i<REP_TIMES;i++)
		{
			status = DftiComputeForward(handle,x,out);

		}
		gettimeofday(&t2,NULL);
		timersub(&t2, &t1, &diff);
		double timeinsec  = (diff.tv_sec*1000+diff.tv_usec/1000)/1000.0;

Can you give me some advice to improve the performance ?I have been tried all methods mentioned by 

https://software.intel.com/en-us/articles/tuning-the-intel-mkl-dft-functions-performance-on-intel-xe...

Best Regards!

Evarist_F_Intel
Employee
109 Views

I have very few additions to the article.

  • I would suggest you try performing in-place transform, since FFT is mostly bandwidth-limited workload, hence reducing memory footprint may significantly improve the performance.
  • If you need Real-to-Complex transform then use it, instead of C2C, because it also reduce the amount of memory needed.
  • You most likely know about optimized FFT-radices
  • Depending on task size it may be beneficial to user-level parallelization or parallelization on DFTI level:
    • if tasks are quite big (4Kx4K is probably big enough) then let DFTI use threads as it wants to do
    • if there are a lot of small disjoint tasks then use parallelization in your code, using sequential DFTI
  • Try using scatter thread pinning -- usually this increase memory throughput
  • Try using `numactl -l` or `numactl -i all` in NUMA systems (2 sockets+). it usually stabilizes and improves performance.

By the way how many threads, what IA hardware and what MKL version do you use? I just may try benchmark the same problem internally and provide you our performance expectation from DFTI.

View solution in original post

王_子_
Beginner
108 Views

Very good, these advice is very important for me , and I will consider these carefully.

The following is some parameters that I used in my MIC offload program:

MIC_ENV_PREFIX=MIC
MIC_USE_2MB_BUFFERS=64K
MKL_MIC_ENABLE=1
MIC_OMP_NUM_THREADS=200
MIC_KMP_AFFINITY=compact
 
My MKL version is 2017 Beta Release Notes, My mic card is 5110p,My CPU is Intel(R) Xeon(R) CPU E5-2620 v2 @ 2.10GHz
And my performance is listed :
100 iteration, 4096*4096 C2C DFT :padded to 4104:
FFT tv_sec:  4.404143 s 
performance: 22.857243 GFLOPS
I want to know the performance is good or not ?Because I need the MIC-offload FFT have the best performance.
Reply