- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
My code:
// -*- C++ -*- # include <cmath> # include <ctime> # include <cstring> # include <cstdio> # include "mkl.h" int main (int argc, char * argv[]) { MKL_LONG D[2] = {SIZE, SIZE}; MKL_LONG C = COUNT; MKL_LONG ST[3] = {0, (D[1] * sizeof(double) + 63) / 64 * (64 / sizeof(double)), 1}; MKL_LONG DI = D[0] * ST[1]; MKL_LONG SI = D[0] * ST[1]; double SC = 1.0 / std::sqrt((double)SI); struct timespec BE, EN; double*const Efft_r = (double*)_mm_malloc(sizeof(double) * SI * C * 2, 64); memset(Efft_r, 0, sizeof(double) * SI * C * 2); double*const Efft_i = Efft_r + SI * C; Efft_r[0] = 1.0; clock_gettime (CLOCK_REALTIME, &BE); for (int i=0; i<LOOP; ++i) { MKL_LONG status; DFTI_DESCRIPTOR_HANDLE hand; DftiCreateDescriptor(&hand, DFTI_DOUBLE, DFTI_COMPLEX, 2, D); DftiSetValue(hand, DFTI_INPUT_STRIDES, ST); DftiSetValue(hand, DFTI_OUTPUT_STRIDES, ST); DftiSetValue(hand, DFTI_NUMBER_OF_TRANSFORMS, C); DftiSetValue(hand, DFTI_INPUT_DISTANCE, DI); DftiSetValue(hand, DFTI_COMPLEX_STORAGE, DFTI_REAL_REAL); DftiSetValue(hand, DFTI_FORWARD_SCALE, SC); DftiSetValue(hand, DFTI_BACKWARD_SCALE, SC); DftiSetValue(hand, DFTI_THREAD_LIMIT, 1); DftiSetValue(hand, DFTI_NUMBER_OF_USER_THREADS, 1); DftiCommitDescriptor(hand); __assume_aligned(Efft_r, 64); __assume_aligned(Efft_i, 64); DftiComputeForward(hand, Efft_r, Efft_i); DftiFreeDescriptor(&hand); } clock_gettime (CLOCK_REALTIME, &EN); printf("DFTI_COMPLEX_STORAGE: DFTI_REAL_REAL\nLOOP: \t%d\nSIZE: \t%d X %d\nSTRIDES:\t%d %d %d\nNUMBER: \t%d\nDISTANCE:\t%d\n\t\t\t\t%.9fs\n", LOOP, D[0], D[1], ST[0], ST[1], ST[2], C, DI, double(EN.tv_sec-BE.tv_sec)+double(EN.tv_nsec-BE.tv_nsec)/1e9); _mm_free(Efft_r); return 0; }
This code was compiled by icpc with flag "-mkl DSIZE=4096 -DLOOP=1 -DCOUNT=3".
When I run this program without MPI, the output is below:
$ ./a.out DFTI_COMPLEX_STORAGE: DFTI_REAL_REAL LOOP: 1 SIZE: 4096 X 4096 STRIDES: 0 4096 1 NUMBER: 3 DISTANCE: 16777216 0.322017125s
When I run the same program with MPI, the output is below:
$ mpirun -n 1 ./a.out DFTI_COMPLEX_STORAGE: DFTI_REAL_REAL LOOP: 1 SIZE: 4096 X 4096 STRIDES: 0 4096 1 NUMBER: 3 DISTANCE: 16777216 1.606980538s
The program without MPI runs much faster than with MPI. I have tried different value of SIZE, but the results are alike.
I have not known why. If I must use MPI, is there any way to keep the speed of MKL?
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
We are investigating. We will get back to you.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Did you use https://software.intel.com/en-us/articles/intel-mkl-link-line-advisor to get the compling and linking switches?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi YangDong,
I am afraid you are not using FFT cluster computing functions & descriptor config function, that data would not be distributed correctly to calculate. For your code implement you probably need to use 'DftiComputeForwardDM' and 'DftiSetValueDM'.
Another point is, I am not sure if you are thread safe or not. If each node could modify the time calculation, the time you print is not actually for main node, but for all node calculation time. I recommend to use MPI interface (mpi_Wtime) to calculate time usage.
Best regards,
Fiona
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Jing X. (Intel) wrote:
Did you use https://software.intel.com/en-us/articles/intel-mkl-link-line-advisor to get the compling and linking switches?
Thank you!
I have solved this problem. I added MPI_Init() and MPI_Finalize() to the code. After the code was compiled by mpiicpc, the program without MPI runs as fast as the program with MPI.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Fiona Z. (Intel) wrote:
Hi YangDong,
I am afraid you are not using FFT cluster computing functions & descriptor config function, that data would not be distributed correctly to calculate. For your code implement you probably need to use 'DftiComputeForwardDM' and 'DftiSetValueDM'.
Another point is, I am not sure if you are thread safe or not. If each node could modify the time calculation, the time you print is not actually for main node, but for all node calculation time. I recommend to use MPI interface (mpi_Wtime) to calculate time usage.
Best regards,
Fiona
hi Fiona,
thanks for your warning.
Single node has private hand and memory space. If every node run independent at the same time, is it thread safe?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi YongDong,
Your FFT code looks fine and do 2D complex to complex FFT on single machine.
The program without MPI runs much faster than with MPI because that MKL FFT is multi-threaded by OpenMP, when you use mpirun to invoke the MKL FFT, it will ignore the OpenMP threads by default. So for same performance of the program without MPI,
you may try
> export OMP_NUM_THREADS=xx (your number of physical cores)
> then mpirun -n 1 ./a.out
Here is what i run:
[yhu5_new@hsw-ep01 FFT]$ export OMP_NUM_THREADS=36
[yhu5_new@hsw-ep01 FFT]$ ./a.out
DFTI_COMPLEX_STORAGE: DFTI_REAL_REAL
LOOP: 1
SIZE: 4096 X 4096
STRIDES: 0 4096 1
NUMBER: 3
DISTANCE: 16777216
0.141292946s
[yhu5_new@hsw-ep01 FFT]$ mpirun -n 1 ./a.out
DFTI_COMPLEX_STORAGE: DFTI_REAL_REAL
LOOP: 1
SIZE: 4096 X 4096
STRIDES: 0 4096 1
NUMBER: 3
DISTANCE: 16777216
0.145391448s
As i understand, your code is no mpi program. so it is not needed to run by mpirun actually. For performance tips, you may refer to MKL user guide or the article https://software.intel.com/en-us/articles/tuning-the-intel-mkl-dft-functions-performance-on-intel-xeon-phi-coprocessors (which is for Xeon phi, but the conceptions are same for other processors)
If you'd like to use MPI , then you need MPI programing, and call Cluster MKL FFT with huge FFT size on multi-nodes. and your may find the cluster FFT sample code under MKL install folder: examples_cluster_c.tgz
Unzip it and see cdftc/source/dm_complex_2d_double_ex1.c
Please refer MKL user guide for more details. https://software.intel.com/en-us/mkl-macos-developer-guide-linking-with-intel-mkl-cluster-software
Best Regards,
Ying
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
/////////////////////////////////////////////////////////////////////////////// // 16384 x 16384 - Processing using DDR4 Strassen HBI Matrix Size : 16384 x 16384 Matrix Size Threshold : 8192 x 8192 Matrix Partitions : 8 Degree of Recursion : 1 Result Sets Reflection: N/A Calculating... Strassen HBI - Pass 01 - Completed: 6.97700 secs Strassen HBI - Pass 02 - Completed: 6.71200 secs Strassen HBI - Pass 03 - Completed: 6.30400 secs Strassen HBI - Pass 04 - Completed: 6.28600 secs Strassen HBI - Pass 05 - Completed: 6.35500 secs ALGORITHM_STRASSENHBI - Passed /////////////////////////////////////////////////////////////////////////////// // 16384 x 16384 - Processing using MCDRAM Strassen HBI Matrix Size : 16384 x 16384 Matrix Size Threshold : 8192 x 8192 Matrix Partitions : 8 Degree of Recursion : 1 Result Sets Reflection: N/A Calculating... Strassen HBI - Pass 01 - Completed: 4.88600 secs Strassen HBI - Pass 02 - Completed: 4.27700 secs Strassen HBI - Pass 05 - Completed: 4.24900 secs Strassen HBI - Pass 03 - Completed: 4.24000 secs Strassen HBI - Pass 04 - Completed: 4.24800 secs ALGORITHM_STRASSENHBI - Passed
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
/////////////////////////////////////////////////////////////////////////////// // 16384 x 16384 - Processing using DDR4 Strassen HBC Matrix Size : 16384 x 16384 Matrix Size Threshold : 8192 x 8192 Matrix Partitions : 8 Degree of Recursion : 1 Result Sets Reflection: Disabled Calculating... Strassen HBC - Pass 01 - Completed: 6.92900 secs Strassen HBC - Pass 02 - Completed: 6.80300 secs Strassen HBC - Pass 03 - Completed: 6.76300 secs Strassen HBC - Pass 04 - Completed: 6.84800 secs Strassen HBC - Pass 05 - Completed: 6.78500 secs ALGORITHM_STRASSENHBC - 1 - Passed /////////////////////////////////////////////////////////////////////////////// // 16384 x 16384 - Processing using MCDRAM Strassen HBC Matrix Size : 16384 x 16384 Matrix Size Threshold : 8192 x 8192 Matrix Partitions : 8 Degree of Recursion : 1 Result Sets Reflection: Disabled Calculating... Strassen HBC - Pass 01 - Completed: 5.03100 secs Strassen HBC - Pass 03 - Completed: 4.96100 secs Strassen HBC - Pass 05 - Completed: 4.94200 secs Strassen HBC - Pass 03 - Completed: 4.96200 secs Strassen HBC - Pass 04 - Completed: 4.95400 secs ALGORITHM_STRASSENHBC - 1 - Passed
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page