Solved: MKL Parallelism seems not working

Tosh · ‎04-22-2022

I have two same computers and ran the same code that used FFT functions of intel MKL in those PCs. However, the calculation speeds were about five times different. To find the reason, I took some tests, and those results were as follows.

1) I took some hardware tests, and they indicated hardware performance was comparable.

2) I ran another code that did not use MKL, the calculation speeds were comparable.

3) I monitored CPU usage, and the time using all cores seemed different. (slower was shorter)

Therefore, I am suspecting that the MKL is not parallelized in the slower PC.

I checked the max thread number of MKL (i.e., using mkl_get_max_threads) and confirmed that it was 16 in both PCs, that have 16 cores and 32 threads. Also, I reinstalled the one API package and saw that the difference in the calculation time became smaller by about two times. However, after some tests, the difference became about five times different again.

I have no idea how I can confirm if this is the case and how I can fix this issue. Could anyone give me any advice?

If it is helpful, I will reinstall the OS (Ubuntu).

VidyalathaB_Intel · ‎04-26-2022

Hi,

>>I found the reason of the speed difference.

Glad to know that and thanks for sharing the information with us.

>>If I compiled the code with an option '-check all,' the calculation speed suddenly decreased.

Yes, you are right. I've tested it from my side and could see the difference with and without -check all option.

By including the -check all option it disables optimization and overrides any optimization level set by option O.

Reference Link: https://www.intel.com/content/www/us/en/develop/documentation/fortran-compiler-oneapi-dev-guide-and-reference/top/compiler-reference/compiler-options/compiler-option-details/language-options/check.html

So this might be the reason behind the decreased speed.

As the issue is resolved, could you please confirm if we can close this thread from our end?

Regards,

Vidya.

View solution in original post

VidyalathaB_Intel · ‎04-24-2022

Hi,

Thanks for reaching out to us.

To proceed further in this case and to address it accordingly we request you to provide us with the following details.

MKL Version being used
Timings that you are getting on both the computers
Respective CPU models of the machines
The command you are using to run the code so that we can understand which threading layer interface you are using (OpenMP/tbb)
Sample reproducer to reproduce the issue from our end

Additionally, you can also check it from the output of verbose which gives you the execution timings of FFT calls in your code.

Before running your code on both systems, please set the below environment variable as

export MKL_VERBOSE=1

For more details regarding the verbose environment variable please refer to the following link

https://www.intel.com/content/www/us/en/develop/documentation/onemkl-linux-developer-guide/top/managing-output/using-onemkl-verbose-mode.html

Please get back to us with the above information so that it will help us better in understanding the issue.

Regards,

Vidya.

Tosh · ‎04-25-2022

Hi Vidya,

Thank you for your response. I appreciate it.
Actually, I found the reason of the speed difference.
If I compiled the code with an option '-check all,' the calculation speed suddenly decreased.
The calculation time was 6 sec without the option, but 1min 9sec with the option.
According to the results of Verbose mode, the timing of FFT is similar between those PCs.
The additional information are as follows.

MKL Version being used:

Major version: 2022
Minor version: 0
Update version: 0
Product status: Product
Build: 20211112
Platform: Intel(R) 64 architecture
Processor optimization: Intel(R) Advanced Vector Extensions 512 (Intel(R) AVX-512) with support of Intel(R) Deep Learning Boost (Intel(R) DL Boost)

================================================================

Timings that you are getting on both the computers:

I ran the attached test code and the timings are as follows:

(PC1)

Running Time: 1min 7sec, CPU Time: 237.7 sec

(PC2)

Running time: 1min 9sec, CPU Time: 238.2 sec

================================================================

Respective CPU models of the machines:

Intel(R) Core(TM) i9-10980XE CPU @ 3.00GHz

================================================================

The command you are using to run the code so that we can understand which threading layer interface you are using (OpenMP/tbb)

I used

'ifort -qopenmp -qmkl -traceback -check all test.f90'

to compile the code and

'./a.out &> out.txt &'

to run the code.

================================================================

Sample reproducer to reproduce the issue from our end:

I attached a sample code 'sample.f90'.

VidyalathaB_Intel · ‎04-26-2022

Hi,

>>I found the reason of the speed difference.

Glad to know that and thanks for sharing the information with us.

>>If I compiled the code with an option '-check all,' the calculation speed suddenly decreased.

Yes, you are right. I've tested it from my side and could see the difference with and without -check all option.

By including the -check all option it disables optimization and overrides any optimization level set by option O.

Reference Link: https://www.intel.com/content/www/us/en/develop/documentation/fortran-compiler-oneapi-dev-guide-and-reference/top/compiler-reference/compiler-options/compiler-option-details/language-options/check.html

So this might be the reason behind the decreased speed.

As the issue is resolved, could you please confirm if we can close this thread from our end?

Regards,

Vidya.

VidyalathaB_Intel · ‎04-26-2022

Hi,

Thanks for the confirmation.

As the issue is resolved we are closing this thread. Please post a new question if you need any additional assistance from Intel as this thread will no longer be monitored.

Have a Great Day!

Regards,

Vidya.