Solved: About the set of padding_leading_dim or input(output) strides for (complex to complex )to increase the performance DFT on MIC

王_子_ · ‎09-01-2016

Hello,

I want to find some methods to increase my 4096*4096 complex to complex DFT performance on MIC using the MKL-DFT. In the following website, I find the attract method:

Tip3: the leading dimension size for multiple –dimensional DFT can increase the DFT performance from 62.137836 GFLOPS to 103.776588 GFLOPS.

( https://software.intel.com/en-us/articles/tuning-the-intel-mkl-dft-functions-performance-on-intel-xeon-phi-coprocessors)

I am surprised at the increasing of the performance, so I want to try padding_leading_dim.But when I rum the example on my cpu ,I do not get the high performance, even the result doesn't changed a lot .

And I want to know that can I improve my DFT performance through the tip? Because I find the complex to complex may not need the input strides and output strides.

Best Regards!

Evarist_F_Intel · ‎09-02-2016

I have very few additions to the article.

I would suggest you try performing in-place transform, since FFT is mostly bandwidth-limited workload, hence reducing memory footprint may significantly improve the performance.
If you need Real-to-Complex transform then use it, instead of C2C, because it also reduce the amount of memory needed.
You most likely know about optimized FFT-radices
Depending on task size it may be beneficial to user-level parallelization or parallelization on DFTI level:
- if tasks are quite big (4Kx4K is probably big enough) then let DFTI use threads as it wants to do
- if there are a lot of small disjoint tasks then use parallelization in your code, using sequential DFTI
Try using scatter thread pinning -- usually this increase memory throughput
Try using `numactl -l` or `numactl -i all` in NUMA systems (2 sockets+). it usually stabilizes and improves performance.

By the way how many threads, what IA hardware and what MKL version do you use? I just may try benchmark the same problem internally and provide you our performance expectation from DFTI.

View solution in original post

Evarist_F_Intel · ‎09-01-2016

Hi,

Could you please post your micro-benchmark here so that we can take a look on your measurements method?

Fixing a leading dimension is essential when stride from element [0][0] to element [1][0] is multiple of say 2, 4, 8 cache-lines. In this case when the algorithm performs column FFT a cache re-usage would be poor b/o cache associativity.
So if your task is 4096x4096 Single Precision Complex-to-Complex FFT, it should be beneficial to have column stride no 4096, but 4096+(64/sizeof(Single Complex)) = 4104.

王_子_ · ‎09-01-2016

I use the sample methods to calculate the performance：

#define REP_TIMES 100

gettimeofday(&t1,NULL);
for(int i=0;i<REP_TIMES;i++)
	status = DftiComputeForward(handle,x,out);
gettimeofday(&t2,NULL);
   
timeuse = t2.tv_sec - t1.tv_sec + (t2.tv_usec - t1.tv_usec)/1000000.0;

double perf= (REP_TIMES*2.5*N[0]*N[1]*log2(1.0*N[0]*N[1]))/timeinsec/1e9;

Do you mean I just need to pad the data from A[4096][4096] to A1[4104][4104]? Other element location will be padded 0 ?

Evarist_F_Intel · ‎09-02-2016

It doesn't matter what data you pad the region -- DFTI would not de-reference those values.

So the setup-code should look like as follows:

int rows = 4096, cols = 4096;
int row_stride = 4104, col_stride = 1;
int sizes[2] = {rows, cols};
int strides[3] = {0, row_stride, col_stride};

for (int r = 0; r < rows; ++r) {
    for (int c = 0; c < cols; ++c) {
        data[r*row_stride + c] = f(r, c);
    }
}

DftiCreateDescriptor(&h, DFTI_SINGLE, DFTI_COMPLEX, 2, sizes);
DftiSetValue(h, DFTI_INPUT_STRIDES, strides);
DftiSetValue(h, DFTI_OUTPUT_STRIDES, strides);
DftiCommitDescriptor(h);

timeit(DftiComputeForward(h, data));

王_子_ · ‎09-02-2016

Thank you for your help!

Your code is very helpful to me. I have used your code and check the output data is correct!

I have learned that my Complex to Complex transform doesn't change my data layout, so the strides can be set easier than Real transform.

But I have a new question, comparing with no padding DFT, the padding increasing the performance a little .For example :

No padding 100 iteration DFT:

FFT tv_sec: 4.544067 s
performance: 22.153014 GFLOPS

Padding 100 iteration DFT:

FFT tv_sec: 4.297437 s
performance: 23.426413 GFLOPS

The part code is following: I don't calculate the copy time consume from indata[4096][4096] to indata[4104][4096] and the outdata1[4104][4096] to outdata[4096][4096].If the time is been calculated,

		gettimeofday(&t1,NULL);
		for(int i=0;i<REP_TIMES;i++)
		{
			status = DftiComputeForward(handle,x,out);

		}
		gettimeofday(&t2,NULL);
		timersub(&t2, &t1, &diff);
		double timeinsec  = (diff.tv_sec*1000+diff.tv_usec/1000)/1000.0;

Can you give me some advice to improve the performance ?I have been tried all methods mentioned by

https://software.intel.com/en-us/articles/tuning-the-intel-mkl-dft-functions-performance-on-intel-xeon-phi-coprocessors

Best Regards!

Evarist_F_Intel · ‎09-02-2016

I have very few additions to the article.

I would suggest you try performing in-place transform, since FFT is mostly bandwidth-limited workload, hence reducing memory footprint may significantly improve the performance.
If you need Real-to-Complex transform then use it, instead of C2C, because it also reduce the amount of memory needed.
You most likely know about optimized FFT-radices
Depending on task size it may be beneficial to user-level parallelization or parallelization on DFTI level:
- if tasks are quite big (4Kx4K is probably big enough) then let DFTI use threads as it wants to do
- if there are a lot of small disjoint tasks then use parallelization in your code, using sequential DFTI
Try using scatter thread pinning -- usually this increase memory throughput
Try using `numactl -l` or `numactl -i all` in NUMA systems (2 sockets+). it usually stabilizes and improves performance.

By the way how many threads, what IA hardware and what MKL version do you use? I just may try benchmark the same problem internally and provide you our performance expectation from DFTI.

王_子_ · ‎09-03-2016

Very good, these advice is very important for me , and I will consider these carefully.

The following is some parameters that I used in my MIC offload program:

MIC_ENV_PREFIX=MIC

MIC_USE_2MB_BUFFERS=64K

MKL_MIC_ENABLE=1

MIC_OMP_NUM_THREADS=200

MIC_KMP_AFFINITY=compact

My MKL version is 2017 Beta Release Notes, My mic card is 5110p,My CPU is Intel(R) Xeon(R) CPU E5-2620 v2 @ 2.10GHz

And my performance is listed :

100 iteration, 4096*4096 C2C DFT :padded to 4104:

FFT tv_sec:  4.404143 s 

	performance: 22.857243 GFLOPS

I want to know the performance is good or not ?Because I need the MIC-offload FFT have the best performance.