I've been doing some testing with Intel's MKL simat_copy function and noticed that its multi-threaded version is in most cases slower than its sequential counter-part (even for large matrices).
The following results were obtained on a Intel i9-10980XE CPU, with environment variables OMP_NUM_THREADS=N and OMP_DYNAMIC=false. I've also tested it with OMP_DYNAMIC=true but the results don't seem to change. The file was compiled using the transposition example Makefile and GCC.
Number of threads:1
Major version: 2020
Platform: Intel(R) 64 architecture
Processor optimization: Intel(R) Advanced Vector Extensions 512 (Intel(R) AVX-512) with support of Intel(R) Deep Learning Boost (Intel(R) DL Boost)
Transpose took 0.046586 seconds
Number of threads 2: Transpose took 0.067779 seconds
Number of threads 4: Transpose took 0.033118 seconds
Number of threads 8: Transpose took 0.046896 seconds
Number of threads 10: Transpose took 0.015994 seconds
Number of threads 18: Transpose took 0.045859 seconds
I find these results very strange and can't find away to explain or improve them.
Any insights regarding how to optimize the parallel version will be deeply appreciated!