Question cycle count of 2048 MKL FFT DftiComputeForward code

Lei_F_Intel1 · ‎06-08-2015

Hello There,

Recently I am using MKL FFT code to get the cycle count of DftiComputeForward. Form mkl documents, DFTI_NUMBER_OF_USER_THREADS is no longer used in latest MKL version. But I made a test.

Method is adding "status = DftiSetValue(FFT_desc, DFTI_NUMBER_OF_USER_THREADS, (1/2/3/4));" in my test code and result is:

	Cycle count
FFT and thread setting	No setting thread	1 thread	2 thread	3 thread	4 thread
128-point	740	800	698	540	448
256-point	1418	923	956	920	960
512-point	3002	2263	1968	1984	1968
1024-point	5848	5044	4130	4185	4113
2048-point	24262	21624	9782	9714	9825

 test code is below:     
   //DFTI_SINGLE is single precision, DFTI_DOUBLE is double precision
        status = DftiCreateDescriptor(&FFT_desc, DFTI_SINGLE, DFTI_COMPLEX, 1, FFTSize);
        //DFTI_INPLACE is FFT output overwrites input, DFTI_NOT_INPLACE is FFT output does not overwrite input
        status = DftiSetValue(FFT_desc, DFTI_PLACEMENT, DFTI_NOT_INPLACE);
  status = DftiSetValue(FFT_desc, DFTI_NUMBER_OF_USER_THREADS, 4);
        //frease FFT descriptor
        status = DftiCommitDescriptor(FFT_desc);

        j = 0;
        for (idxTimeLoop = 0; idxTimeLoop < taskCallsNumber / internalLoopCounter; idxTimeLoop++)
        {
            unsigned __int64 clockStart, clockEnd;
            clockStart = GetTickAndTime(&getStartTick, &getStartTime);

            for (idxLoop = 0; idxLoop < internalLoopCounter; idxLoop++)
            {
                //run fft with forward method
                status = DftiComputeForward(FFT_desc, FFT_in_singlePrecision, FFT_out_singlePrecision);
 
            }
            clockEnd = GetTickAndTime(&getEndTick, &getEndTime);
            clockNumArray = getEndTick - getStartTick;
            timeDurationArray = (getEndTime - getStartTime)*1000.0;
            j++;
        }

My MKL version information:
Major version:           11
Minor version:           2
Update version:          3
Product status:          Product
Build:                   20150413
Platform:                Intel(R) 64 architecture
Processor optimization: Intel(R) Advanced Vector Extensions (Intel(R) AVX) enabled processors

OS: win7

Porcessor: i5-3320M 2.6GHz.

My question: why the cycle count of 2048-point MKL FFT DftiComputeForward is about 4 times than 1024-point. Does this question is brought by data cache or something else? And why setting DFTI_NUMBER_OF_USER_THREADS can affect performance of 2048-point FFT DftiComputeForward. Please feel free to contact me if you need more info about my test code.

Thanks a lot!

Lei Fu

Lei_F_Intel1 · ‎06-08-2015

Sorry, this is topic not support excel: So I copy the result below:

FFT and Thread Setting No setting thread 1 thread 2 thread 3 thread 4 thread

128-point 740 800 698 540 448

256-point 1418 923 956 920 960

512-point 3002 2263 1968 1984 1968

1024-point 5848 5044 4130 4185 4113

2048-point 24262 21624 9782 9714 9825

Thanks a lot!

Lei Fu

Evgueni_P_Intel · ‎06-08-2015

Hi Lei Fu,

Recent versions of Intel MKL (including 11.2.3 that you are using) ignore DFTI_NUMBER_OF_USER_THREADS.

You seem to report an effect of cache warmup in your post. To avoid this effect, either call DftiComputeForward one more time before the timing loop, or wipe caches by copying an array exceeding last-level cache.

Evgueni.

Lei_F_Intel1 · ‎06-09-2015

Hi Evgueni,

Thanks for the reply. I have called DftiComputeForward more times(1~1000) before the timing loop, but I still see the performance of 2048-point is the same as before. So I want to try another way. Could you please give me some more details on how to "wipe caches by copying an array exceeding last-level cache". Besides, It's hard to adding some additional movement to avoid this affect in my project, as there is a 1ms TTI timing requirement in our project. And except FFT, there are a lot of other high cycle consumption algorithms. So do you know any other way to lower or avoid the affect of cache warmup.

Thanks a lot!

Best Regards,

Lei Fu

Lei_F_Intel1 · ‎06-09-2015

Hi Evgueni,

I also made another test:

Using "status = DftiSetValue(FFT_desc, DFTI_PLACEMENT, DFTI_INPLACE)"; that means:Result overwrites input. But the 2048-point is the same. Besides, my colleague have made another test: Run MKL FFT with double precision *input. and set ""status = DftiSetValue(FFT_desc, DFTI_PLACEMENT, DFTI_NOT_INPLACE)"; Result is

FFT and thread Setting not setting thread with double precision input

128-point 834

256-point 1594

512-point 3536

1024-point 8457

2048-point 34051

We can see double 1024-point FFT(size is same as single 2048-point) is about 2 times as 512-point. And 2048-point is abnormal.

Best Regards,

Lei Fu

Evgueni_P_Intel · ‎06-09-2015

Thank you for additional input.

Obviously, you are interested in performance of sequential Intel MKL and link the benchmark against threaded Intel MKL.

Please set OMP_NUM_THREADS to 1 before benchmarking, or just link the benchmark against sequential Intel MKL.

With OMP_NUM_THREADS not set, Intel MKL selects threaded implementation whenever such implementation exists.

Nikolai_L_Intel · ‎06-09-2015

Hi Evgueni,

I didn't understand few things: 1) if DFTI_NUMBER_OF_USER_THRE is ignored, how Lei gets different results changing it ? Cache warming explanation looks not solid, we run thousand times the same and take mean MIPS number. 2) In fact, we are looking for the best MIPS number, doesn't matter is it sequential or threaded. We want to understand the dependency of MIPS against number of threads, and we see better results with several threads than sequential. Also, this 2048-points abnormal behavior we can't explain. It seems it might be because of setting MKL mode in compiler. Could you clarify more?

Thanks,

Nick.