Intel® oneAPI Math Kernel Library
Ask questions and share information with other developers who use Intel® Math Kernel Library.
6977 Discussions

Question cycle count of 2048 MKL FFT DftiComputeForward code

Lei_F_Intel1
Employee
470 Views

Hello There,

Recently I am using MKL FFT code to get the cycle count of DftiComputeForward. Form mkl documents, DFTI_NUMBER_OF_USER_THREADS is no longer used in latest MKL version. But I made a test.

Method is adding "status = DftiSetValue(FFT_desc, DFTI_NUMBER_OF_USER_THREADS, (1/2/3/4));" in my test code and result is: 

  Cycle count
FFT and thread setting No setting thread 1 thread 2 thread 3 thread 4 thread
128-point 740 800 698 540 448
256-point 1418 923 956 920 960
512-point 3002 2263 1968 1984 1968
1024-point 5848 5044 4130 4185 4113
2048-point 24262 21624 9782 9714 9825
 test code is below:     
   //DFTI_SINGLE is single precision, DFTI_DOUBLE is double precision
        status = DftiCreateDescriptor(&FFT_desc, DFTI_SINGLE, DFTI_COMPLEX, 1, FFTSize);
        //DFTI_INPLACE is FFT output overwrites input, DFTI_NOT_INPLACE is FFT output does not overwrite input
        status = DftiSetValue(FFT_desc, DFTI_PLACEMENT, DFTI_NOT_INPLACE);
  status = DftiSetValue(FFT_desc, DFTI_NUMBER_OF_USER_THREADS, 4);
        //frease FFT descriptor
        status = DftiCommitDescriptor(FFT_desc);

        j = 0;
        for (idxTimeLoop = 0; idxTimeLoop < taskCallsNumber / internalLoopCounter; idxTimeLoop++)
        {
            unsigned __int64 clockStart, clockEnd;
            clockStart = GetTickAndTime(&getStartTick, &getStartTime);

            for (idxLoop = 0; idxLoop < internalLoopCounter; idxLoop++)
            {
                //run fft with forward method
                status = DftiComputeForward(FFT_desc, FFT_in_singlePrecision, FFT_out_singlePrecision);
 
            }
            clockEnd = GetTickAndTime(&getEndTick, &getEndTime);
            clockNumArray = getEndTick - getStartTick;
            timeDurationArray = (getEndTime - getStartTime)*1000.0;
            j++;
        }

My MKL version information:
Major version:           11
Minor version:           2
Update version:          3
Product status:          Product
Build:                   20150413
Platform:                Intel(R) 64 architecture
Processor optimization:  Intel(R) Advanced Vector Extensions (Intel(R) AVX) enabled processors

OS: win7

Porcessor: i5-3320M 2.6GHz.

My question: why the cycle count of 2048-point MKL FFT DftiComputeForward is about 4 times than 1024-point. Does this question is brought by data cache or something else?  And why setting DFTI_NUMBER_OF_USER_THREADS can affect performance of 2048-point FFT DftiComputeForward. Please feel free to contact me if you need more info about my test code.

Thanks a lot!

Lei Fu

 

0 Kudos
6 Replies
Lei_F_Intel1
Employee
470 Views

Sorry, this is topic not support excel: So I copy the result below:

FFT and Thread Setting  No setting thread       1 thread    2 thread   3 thread   4 thread

128-point                             740                             800           698          540          448

256-point                             1418                           923           956          920          960

512-point                             3002                          2263         1968         1984        1968

1024-point                           5848                          5044         4130         4185        4113

2048-point                          24262                         21624       9782         9714        9825

 

Thanks a lot!

Lei Fu

0 Kudos
Evgueni_P_Intel
Employee
470 Views

Hi Lei Fu,

Recent versions of Intel MKL (including 11.2.3 that you are using) ignore DFTI_NUMBER_OF_USER_THREADS.

You seem to report an effect of cache warmup in your post. To avoid this effect, either call DftiComputeForward one more time before the timing loop, or wipe caches by copying an array exceeding last-level cache.

Evgueni.

0 Kudos
Lei_F_Intel1
Employee
471 Views

Hi Evgueni,

Thanks for the reply. I have called DftiComputeForward more times(1~1000) before the timing loop, but I still see the performance of 2048-point is the same as before. So I want to try another way. Could you please give me some more details on how to "wipe caches by copying an array exceeding last-level cache". Besides, It's hard to adding some additional movement to avoid this affect in my project, as there is a 1ms TTI timing requirement in our project. And except FFT, there are a lot of other high cycle consumption algorithms. So do you know any other way to lower or avoid the affect of cache warmup.

Thanks a lot!

Best Regards,

Lei Fu

0 Kudos
Lei_F_Intel1
Employee
471 Views

Hi Evgueni,

I also made another test:

Using "status = DftiSetValue(FFT_desc, DFTI_PLACEMENT, DFTI_INPLACE)"; that means:Result overwrites input. But the 2048-point is the same. Besides, my colleague have made another test: Run MKL FFT with double precision  *input. and set ""status = DftiSetValue(FFT_desc, DFTI_PLACEMENT, DFTI_NOT_INPLACE)"; Result is

FFT and thread Setting   not setting thread with double precision input

128-point                             834

256-point                             1594

512-point                             3536

1024-point                          8457

2048-point                          34051

We can see double 1024-point FFT(size is same as single 2048-point) is about 2 times as 512-point. And 2048-point is abnormal.

Best Regards,

Lei Fu 

0 Kudos
Evgueni_P_Intel
Employee
471 Views

Thank you for additional input.

Obviously, you are interested in performance of sequential Intel MKL and link the benchmark against threaded Intel MKL.

Please set OMP_NUM_THREADS to 1 before benchmarking, or just link the benchmark against sequential Intel MKL.

With OMP_NUM_THREADS not set, Intel MKL selects threaded implementation whenever such implementation exists.

0 Kudos
Nikolai_L_Intel
Employee
471 Views

Hi Evgueni,

I didn't understand few things: 1) if DFTI_NUMBER_OF_USER_THRE is ignored, how Lei gets different results changing it ? Cache warming explanation looks not solid, we run thousand times the same and take mean MIPS number. 2) In fact, we are looking for the best MIPS number, doesn't matter is it sequential or threaded. We want to understand the dependency of MIPS against number of threads, and we see better results with several threads than sequential. Also, this 2048-points abnormal behavior we can't explain. It seems it might be because of setting MKL mode in compiler. Could you clarify more?

Thanks,

Nick.

 

 

 

0 Kudos
Reply