- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hello There,
Recently I am using MKL FFT code to get the cycle count of DftiComputeForward. Form mkl documents, DFTI_NUMBER_OF_USER_THREADS is no longer used in latest MKL version. But I made a test.
Method is adding "status = DftiSetValue(FFT_desc, DFTI_NUMBER_OF_USER_THREADS, (1/2/3/4));" in my test code and result is:
Cycle count | |||||
FFT and thread setting | No setting thread | 1 thread | 2 thread | 3 thread | 4 thread |
128-point | 740 | 800 | 698 | 540 | 448 |
256-point | 1418 | 923 | 956 | 920 | 960 |
512-point | 3002 | 2263 | 1968 | 1984 | 1968 |
1024-point | 5848 | 5044 | 4130 | 4185 | 4113 |
2048-point | 24262 | 21624 | 9782 | 9714 | 9825 |
test code is below: //DFTI_SINGLE is single precision, DFTI_DOUBLE is double precision status = DftiCreateDescriptor(&FFT_desc, DFTI_SINGLE, DFTI_COMPLEX, 1, FFTSize); //DFTI_INPLACE is FFT output overwrites input, DFTI_NOT_INPLACE is FFT output does not overwrite input status = DftiSetValue(FFT_desc, DFTI_PLACEMENT, DFTI_NOT_INPLACE); status = DftiSetValue(FFT_desc, DFTI_NUMBER_OF_USER_THREADS, 4); //frease FFT descriptor status = DftiCommitDescriptor(FFT_desc); j = 0; for (idxTimeLoop = 0; idxTimeLoop < taskCallsNumber / internalLoopCounter; idxTimeLoop++) { unsigned __int64 clockStart, clockEnd; clockStart = GetTickAndTime(&getStartTick, &getStartTime); for (idxLoop = 0; idxLoop < internalLoopCounter; idxLoop++) { //run fft with forward method status = DftiComputeForward(FFT_desc, FFT_in_singlePrecision, FFT_out_singlePrecision); } clockEnd = GetTickAndTime(&getEndTick, &getEndTime); clockNumArray= getEndTick - getStartTick; timeDurationArray = (getEndTime - getStartTime)*1000.0; j++; }
My MKL version information:
Major version: 11
Minor version: 2
Update version: 3
Product status: Product
Build: 20150413
Platform: Intel(R) 64 architecture
Processor optimization: Intel(R) Advanced Vector Extensions (Intel(R) AVX) enabled processors
OS: win7
Porcessor: i5-3320M 2.6GHz.
My question: why the cycle count of 2048-point MKL FFT DftiComputeForward is about 4 times than 1024-point. Does this question is brought by data cache or something else? And why setting DFTI_NUMBER_OF_USER_THREADS can affect performance of 2048-point FFT DftiComputeForward. Please feel free to contact me if you need more info about my test code.
Thanks a lot!
Lei Fu
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Sorry, this is topic not support excel: So I copy the result below:
FFT and Thread Setting No setting thread 1 thread 2 thread 3 thread 4 thread
128-point 740 800 698 540 448
256-point 1418 923 956 920 960
512-point 3002 2263 1968 1984 1968
1024-point 5848 5044 4130 4185 4113
2048-point 24262 21624 9782 9714 9825
Thanks a lot!
Lei Fu
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Lei Fu,
Recent versions of Intel MKL (including 11.2.3 that you are using) ignore DFTI_NUMBER_OF_USER_THREADS.
You seem to report an effect of cache warmup in your post. To avoid this effect, either call DftiComputeForward one more time before the timing loop, or wipe caches by copying an array exceeding last-level cache.
Evgueni.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Evgueni,
Thanks for the reply. I have called DftiComputeForward more times(1~1000) before the timing loop, but I still see the performance of 2048-point is the same as before. So I want to try another way. Could you please give me some more details on how to "wipe caches by copying an array exceeding last-level cache". Besides, It's hard to adding some additional movement to avoid this affect in my project, as there is a 1ms TTI timing requirement in our project. And except FFT, there are a lot of other high cycle consumption algorithms. So do you know any other way to lower or avoid the affect of cache warmup.
Thanks a lot!
Best Regards,
Lei Fu
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Evgueni,
I also made another test:
Using "status = DftiSetValue(FFT_desc, DFTI_PLACEMENT, DFTI_INPLACE)"; that means:Result overwrites input. But the 2048-point is the same. Besides, my colleague have made another test: Run MKL FFT with double precision *input. and set ""status = DftiSetValue(FFT_desc, DFTI_PLACEMENT, DFTI_NOT_INPLACE)"; Result is
FFT and thread Setting not setting thread with double precision input
128-point 834
256-point 1594
512-point 3536
1024-point 8457
2048-point 34051
We can see double 1024-point FFT(size is same as single 2048-point) is about 2 times as 512-point. And 2048-point is abnormal.
Best Regards,
Lei Fu
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Thank you for additional input.
Obviously, you are interested in performance of sequential Intel MKL and link the benchmark against threaded Intel MKL.
Please set OMP_NUM_THREADS to 1 before benchmarking, or just link the benchmark against sequential Intel MKL.
With OMP_NUM_THREADS not set, Intel MKL selects threaded implementation whenever such implementation exists.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Evgueni,
I didn't understand few things: 1) if DFTI_NUMBER_OF_USER_THRE is ignored, how Lei gets different results changing it ? Cache warming explanation looks not solid, we run thousand times the same and take mean MIPS number. 2) In fact, we are looking for the best MIPS number, doesn't matter is it sequential or threaded. We want to understand the dependency of MIPS against number of threads, and we see better results with several threads than sequential. Also, this 2048-points abnormal behavior we can't explain. It seems it might be because of setting MKL mode in compiler. Could you clarify more?
Thanks,
Nick.
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page