Re:MKL's simat_copy poor parallel performance

JoaoAlves95 · ‎01-26-2021

Hello all,

I've been doing some testing with Intel's MKL simat_copy function and noticed that its multi-threaded version is in most cases slower than its sequential counter-part (even for large matrices).

The following results were obtained on a Intel i9-10980XE CPU, with environment variables OMP_NUM_THREADS=N and OMP_DYNAMIC=false. I've also tested it with OMP_DYNAMIC=true but the results don't seem to change. The file was compiled using the transposition example Makefile and GCC.

Single-threaded:

Number of threads:1
Major version: 2020
...
Platform: Intel(R) 64 architecture
Processor optimization: Intel(R) Advanced Vector Extensions 512 (Intel(R) AVX-512) with support of Intel(R) Deep Learning Boost (Intel(R) DL Boost)
================================================================

Transpose took 0.046586 seconds

Multi-threaded:

Number of threads 2: Transpose took 0.067779 seconds

Number of threads 4: Transpose took 0.033118 seconds

Number of threads 8: Transpose took 0.046896 seconds

Number of threads 10: Transpose took 0.015994 seconds

Number of threads 18: Transpose took 0.045859 seconds

I find these results very strange and can't find away to explain or improve them.

Any insights regarding how to optimize the parallel version will be deeply appreciated!

JoaoAlves95 · ‎01-26-2021

Forgot to add that the input matrix is 8000x8000 and also tested with variable.

RahulV_intel · ‎01-29-2021

Hi,

Thanks for reporting this issue. I have forwarded your query to the MKL experts. They will get back to you.

Regards,

Rahul

Khang_N_Intel · ‎05-24-2021

Hi Joao,

I have been for your reply about which OS (Windows or Linux) you were using.

You didn't even me the instruction how to build the app.

I went ahead and built this code on both Windows and Linux.

Windows:

icl /Qopenmp parallel_test.c /Qmkl=parallel

Linux:

gcc -DMKL_ILP64 -m64 -I"${MKLROOT}/include" parallel_test.c -L${MKLROOT}/lib/intel64 -Wl,--no-as-needed -lmkl_intel_ilp64 -lmkl_gnu_thread -lmkl_core -lgomp -lpthread -lm -ldl

I was able to build and link on both Windows and Linux. However, when I tried to run the code, it gave me a segmentation fault error.

I tested the code on the latest version of oneMKL, 2021.2.0.

Since it has been a long time, I would assume that you already got this issue resolved. I will go ahead and close this issue.