Hello all,
I've been doing some testing with Intel's MKL simat_copy function and noticed that its multi-threaded version is in most cases slower than its sequential counter-part (even for large matrices).
The following results were obtained on a Intel i9-10980XE CPU, with environment variables OMP_NUM_THREADS=N and OMP_DYNAMIC=false. I've also tested it with OMP_DYNAMIC=true but the results don't seem to change. The file was compiled using the transposition example Makefile and GCC.
Single-threaded:
Number of threads:1
Major version: 2020
...
Platform: Intel(R) 64 architecture
Processor optimization: Intel(R) Advanced Vector Extensions 512 (Intel(R) AVX-512) with support of Intel(R) Deep Learning Boost (Intel(R) DL Boost)
================================================================
Transpose took 0.046586 seconds
Multi-threaded:
Number of threads 2: Transpose took 0.067779 seconds
Number of threads 4: Transpose took 0.033118 seconds
Number of threads 8: Transpose took 0.046896 seconds
Number of threads 10: Transpose took 0.015994 seconds
Number of threads 18: Transpose took 0.045859 seconds
I find these results very strange and can't find away to explain or improve them.
Any insights regarding how to optimize the parallel version will be deeply appreciated!
Link Copied
Forgot to add that the input matrix is 8000x8000 and also tested with variable.
Hi,
Thanks for reporting this issue. I have forwarded your query to the MKL experts. They will get back to you.
Regards,
Rahul
For more complete information about compiler optimizations, see our Optimization Notice.