Solved: Why is the parallel version of zheevd slower than the sequential one?

Matthias_R_ · ‎04-29-2017

I wrote a piece of code to test the speed of the zheevd function(below). Then I use the Intel link advisor to compile it. These are the times I measured:

sequential: 20.15s

TBB: 30.64s

OpenMP: 30.5s

Why is the sequential code faster then either of the parallel version? Is it possible to speedup the zheevd through parallelization?

program STB
    !use mkl_service
    implicit none
    integer(4)      :: num, i
    call test_herm()
contains
    Subroutine  test_herm()
        Implicit None
        integer(4), parameter         :: N =  4000, LDA =  N, LWMAX =  100000
        integer(4)                    :: info, LWORK, LIWORK, LRWORK, i,j
        real(8)                       :: r,c

        integer(4), dimension(LWMAX)  :: IWORK 
        real(8), dimension(N)         :: W
        real(8), dimension(LWMAX)     :: RWORK
        complex(16), dimension(LDA, N):: A
        complex(16), dimension(LWMAX) :: WORK

        call mkl_set_num_threads(4)
        call random_seed()
        do i =  1,N
            do j =  1,i-1
                call random_number(r)
                call random_number(c)
                A(i,j) = cmplx(r,c)
                A(j,i) = conjg(A(i,j))
            enddo
        enddo

        do i =  1,N
            call random_number(r)
            A(i,i) = cmplx(r,0)
        enddo

        LWORK  = LWMAX 
        LIWORK = LWMAX 
        LRWORK = LWMAX

        !call zheevd('N', 'L', N, A, LDA, W, WORK, LWORK, RWORK, &
                     !LRWORK, IWORK, LIWORK, info)
        CALL ZHEEVD( 'N', 'Lower', N, A, LDA, W, WORK, LWORK, RWORK,&
                  LRWORK, IWORK, LIWORK, INFO )

        write (*,*) "Info: ", info
        write (*,*) "Lwork: ", LWORk
        write (*,*) "Liwork: ", LIWORK
        write (*,*) "LRWORK: ", LRWORK
        !write (*,*) W
    End Subroutine test_herm
end program STB

Ying_H_Intel · ‎05-09-2017

Hi,

From the MKL result, it seems the parallel version Nthr (thread number ) =2 is faster and sequential version: 14.54s CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:2, vs. 19.64s CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1

what kind of your OS? why your time result is like ./seq.x 20,88s user 0,08s system 99% cpu 20,969 total

Seem not the real one :

real    0m10.755s            -----> 2 threads
user    0m19.369s
sys     0m0.426s. ,

And it seems the 20.88s (seq) and 29.97s (2 thread) should be user time in your cases.

thus 29.97ms actually = core 1 + core 2, the more number of cores, the bigger of the value. not the real time (wall times)

Best Regards,

Ying

View solution in original post

Ying_H_Intel · ‎05-01-2017

Hi Matthias,

Could you please tell what kind of the CPU and MKL version are you using?

I did checked on one Xeon processor:

: Intel(R) Xeon(R) CPU E5-2699 v3 @ 2.30GHz, 2 processors, 18x2 =36 physical cores, support AVX2

MKL 2017 update 3 .

Please note: mkl_set_num_threads(4) in the code.

Test 1: OpenMP with 4 threads

ifort zheevd.f90 -mkl

time ./a.out

real    0m6.386s
user    0m21.271s
sys     0m0.445s

Test 2: Sequential

[yhu5_new@hsw-ep01 ~]$ ifort zheevd.f90 -mkl:sequential -o z_s.out
[yhu5_new@hsw-ep01 ~]$ time ./z_s.out
Info:            0
Lwork:       100000
Liwork:       100000
LRWORK:       100000

real    0m23.494s
user    0m23.174s
sys     0m0.273s

Test 3: TBB

[yhu5_new@hsw-ep01 ~]$ ifort zheevd.f90 -Wl,--start-group ${MKLROOT}/lib/intel64/libmkl_intel_lp64.a ${MKLROOT}/lib/intel64/libmkl_tbb_thread.a ${MKLROOT}/lib/intel64/libmkl_core.a -Wl,--end-group -ltbb -lstdc++ -lpthread -lm -ldl -o z_b.out
[yhu5_new@hsw-ep01 ~]$ time ./z_b.out
Info:            0
Lwork:       100000
Liwork:       100000
LRWORK:       100000

real    0m4.992s
user    1m47.180s
sys     0m9.120s

Test 4: remove the mkl_set_num_threads(4) for OpenMP threads.

[[yhu5_new@hsw-ep01 ~]$ ifort zheevd.f90 -mkl -o z_o.out

[yhu5_new@hsw-ep01 ~]$ time ./z_o.out
Info:            0
Lwork:       100000
Liwork:       100000
LRWORK:       100000

real    0m4.305s
user    1m45.560s
sys     0m1.946s

from the test, the parallel result is better than sequential.

Best Regards,

Ying

I use the below to verify the linked library: OpenMP, TBB, Sequential

[yhu5_new@hsw-ep01 ~]$ ldd z_o.out
        linux-vdso.so.1 => (0x00007ffd7970b000)
        libmkl_intel_lp64.so => /opt/intel/compilers_and_libraries_2017.3.191/linux/mkl/lib/intel64_lin/libmkl_intel_lp64.so (0x00007fcf318fa000)
        libmkl_intel_thread.so => /opt/intel/compilers_and_libraries_2017.3.191/linux/mkl/lib/intel64_lin/libmkl_intel_thread.so (0x00007fcf2fe68000)
        libmkl_core.so => /opt/intel/compilers_and_libraries_2017.3.191/linux/mkl/lib/intel64_lin/libmkl_core.so (0x00007fcf2e375000)
        libiomp5.so => /opt/intel/compilers_and_libraries_2017.3.191/linux/compiler/lib/intel64/libiomp5.so (0x00007fcf2dfd2000)
        libm.so.6 => /lib64/libm.so.6 (0x00007fcf2dcaa000)
        libpthread.so.0 => /lib64/libpthread.so.0 (0x00007fcf2da8e000)
        libc.so.6 => /lib64/libc.so.6 (0x00007fcf2d6cc000)
        libgcc_s.so.1 => /lib64/libgcc_s.so.1 (0x00007fcf2d4b6000)
        libdl.so.2 => /lib64/libdl.so.2 (0x00007fcf2d2b2000)
        /lib64/ld-linux-x86-64.so.2 (0x00007fcf3233c000)

Matthias_R_ · ‎05-01-2017

This is the version I have installed:

$ export | grep MKL
MKLROOT=/opt/intel/compilers_and_libraries_2017.3.191/linux/mkl

and this is my cpu

$ cat /proc/cpuinfo                                                                                                                                                                       
processor       : 0
vendor_id       : GenuineIntel
cpu family      : 6
model           : 61
model name      : Intel(R) Core(TM) i5-5200U CPU @ 2.20GHz
stepping        : 4
microcode       : 0x21
cpu MHz         : 1588.098
cache size      : 3072 KB
physical id     : 0
siblings        : 4
core id         : 0
cpu cores       : 2
apicid          : 0
initial apicid  : 0
fpu             : yes
fpu_exception   : yes
cpuid level     : 20
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch epb intel_pt tpr_shadow vnmi flexpriority ept vpid fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid rdseed adx smap xsaveopt dtherm ida arat pln pts
bugs            :
bogomips        : 4391.13
clflush size    : 64
cache_alignment : 64
address sizes   : 39 bits physical, 48 bits virtual
power management:

processor       : 1
vendor_id       : GenuineIntel
cpu family      : 6
model           : 61
model name      : Intel(R) Core(TM) i5-5200U CPU @ 2.20GHz
stepping        : 4
microcode       : 0x21
cpu MHz         : 1306.787
cache size      : 3072 KB
physical id     : 0
siblings        : 4
core id         : 0
cpu cores       : 2
apicid          : 1
initial apicid  : 1
fpu             : yes
fpu_exception   : yes
cpuid level     : 20
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch epb intel_pt tpr_shadow vnmi flexpriority ept vpid fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid rdseed adx smap xsaveopt dtherm ida arat pln pts
bugs            :
bogomips        : 4393.10
clflush size    : 64
cache_alignment : 64
address sizes   : 39 bits physical, 48 bits virtual
power management:

processor       : 2
vendor_id       : GenuineIntel
cpu family      : 6
model           : 61
model name      : Intel(R) Core(TM) i5-5200U CPU @ 2.20GHz
stepping        : 4
microcode       : 0x21
cpu MHz         : 1130.078
cache size      : 3072 KB
physical id     : 0
siblings        : 4
core id         : 1
cpu cores       : 2
apicid          : 2
initial apicid  : 2
fpu             : yes
fpu_exception   : yes
cpuid level     : 20
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch epb intel_pt tpr_shadow vnmi flexpriority ept vpid fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid rdseed adx smap xsaveopt dtherm ida arat pln pts
bugs            :
bogomips        : 4393.54
clflush size    : 64
cache_alignment : 64
address sizes   : 39 bits physical, 48 bits virtual
power management:

processor       : 3
vendor_id       : GenuineIntel
cpu family      : 6
model           : 61
model name      : Intel(R) Core(TM) i5-5200U CPU @ 2.20GHz
stepping        : 4
microcode       : 0x21
cpu MHz         : 1299.938
cache size      : 3072 KB
physical id     : 0
siblings        : 4
core id         : 1
cpu cores       : 2
apicid          : 3
initial apicid  : 3
fpu             : yes
fpu_exception   : yes
cpuid level     : 20
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch epb intel_pt tpr_shadow vnmi flexpriority ept vpid fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid rdseed adx smap xsaveopt dtherm ida arat pln pts
bugs            :
bogomips        : 4393.44
clflush size    : 64
cache_alignment : 64
address sizes   : 39 bits physical, 48 bits virtual
power management:

Ying_H_Intel · ‎05-07-2017

Hi Matthias ,

Thanks for letting us know the CPU type. Intel(R) Core(TM) i5-5200U CPU @ 2.20GHz is 2 physical cores machine, so generally, MKL will use 2 thread automatically.

How do you build the application?

Could you please remove call mkl_set_num_threads(4) in the code , then try to run the below command

>export MKL_VERBOSE=1

>time ./your.out

and copy the result here.

Thanks

Ying

[yhu5_new@hsw-ep01 ~]$ export MKL_VERBOSE=1
[yhu5_new@hsw-ep01 ~]$ time ./z_o1.out
MKL_VERBOSE Intel(R) MKL 2017.0 Update 2 Product build 20170126 for Intel(R) 64 architecture Intel(R) Advanced Vector Extensions 2 (Intel(R) AVX2) enabled processors, Lnx 2.30GHz lp64 intel_thread NMICDev:0
MKL_VERBOSE ZHEEVD(N,L,4000,0x6e1760,4000,0x1ef29960,0x1ef31660,100000,0x1f23ea60,100000,0x1f301f60,100000,0) 22.04s CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 WDiv:HOST:+0.000
Info:            0
Lwork:       100000
Liwork:       100000
LRWORK:       100000

real    0m23.803s    -> sequential
user    0m23.396s
sys     0m0.392s

[yhu5_new@hsw-ep01 ~]$ ifort zheevd.f90 -mkl -o z_o1.out

[yhu5_new@hsw-ep01 ~]$ export OMP_NUM_THREADS=2
[yhu5_new@hsw-ep01 ~]$ time ./z_o1.out
MKL_VERBOSE Intel(R) MKL 2017.0 Update 2 Product build 20170126 for Intel(R) 64 architecture Intel(R) Advanced Vector Extensions 2 (Intel(R) AVX2) enabled processors, Lnx 2.30GHz lp64 intel_thread NMICDev:0
MKL_VERBOSE ZHEEVD(N,L,4000,0x6e1760,4000,0x1ef29960,0x1ef31660,100000,0x1f23ea60,100000,0x1f301f60,100000,0) 9.17s CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:2 WDiv:HOST:+0.000
Info:            0
Lwork:       100000
Liwork:       100000
LRWORK:       100000

real    0m10.755s            -----> 2 threads
user    0m19.369s
sys     0m0.426s

Matthias_R_ · ‎05-07-2017

I did what you suggested and this is the result:

matthias@laptop ~                                                                                                                                          [8:19:14] 
> $ ifort test.f90 -mkl:sequential -o seq.x                                                                                                                         
                                                                                                                                                                     
matthias@laptop ~                                                                                                                                          [8:19:20] 
> $ ifort test.f90 -mkl -o para.x                                                                                                                                   
                                                                                                                                                                     
matthias@laptop ~                                                                                                                                          [8:19:29] 
> $ time ./para.x                                                                                                                                                   
MKL_VERBOSE Intel(R) MKL 2017.0 Update 2 Product build 20170126 for Intel(R) 64 architecture Intel(R) Advanced Vector Extensions 2 (Intel(R) AVX2) enabled processors, Lnx 2.20GHz lp64 intel_thread NMICDev:0
MKL_VERBOSE ZHEEVD(N,L,4000,0x6e1520,4000,0xfb05720,0xfb0d420,100000,0xfc93e20,100000,0xfd57320,100000,0) 14.54s CNR:OFF Dyn:1 FastMM:1 TID:0  NThr:2 WDiv:HOST:+0.000
 Info:            0
 Lwork:       100000
 Liwork:       100000
 LRWORK:       100000
./para.x  29,97s user 0,13s system 191% cpu 15,686 total
                                                                                                                                                                     
matthias@laptop ~                                                                                                                                          [8:19:50] 
> $ time ./seq.x                                                                                                                                                    
MKL_VERBOSE Intel(R) MKL 2017.0 Update 2 Product build 20170126 for Intel(R) 64 architecture Intel(R) Advanced Vector Extensions 2 (Intel(R) AVX2) enabled processors, Lnx 2.20GHz lp64 sequential
MKL_VERBOSE ZHEEVD(N,L,4000,0x6e0500,4000,0xfb04700,0xfb0c400,100000,0xfc92e00,100000,0xfd56300,100000,0) 19.64s CNR:OFF Dyn:1 FastMM:1 TID:0  NThr:1
 Info:            0
 Lwork:       100000
 Liwork:       100000
 LRWORK:       100000
./seq.x  20,88s user 0,08s system 99% cpu 20,969 total

Do I need to run this on a system with more than 2 physical cores to see a speedup?

TimP · ‎05-08-2017

If you force mkl to use more than 1 thread per core, you will not expect full performance. So you will need more than 2 physical cores to exceed 2x parallel speedup.

Matthias_R_ · ‎05-08-2017

I removed the

call mkl_set_num_threads(4)

line. In my results above you can see that the parallel version uses 2 cores( 195%), which i confirmed in htop. I would expect the parallel version to best roughly twice as fast (or faster at all) for 2 cores. This is not the case. The sequential version ist faster, as you can see above.

Ying_H_Intel · ‎05-09-2017

Hi,

From the MKL result, it seems the parallel version Nthr (thread number ) =2 is faster and sequential version: 14.54s CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:2, vs. 19.64s CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1

what kind of your OS? why your time result is like ./seq.x 20,88s user 0,08s system 99% cpu 20,969 total

Seem not the real one :

real    0m10.755s            -----> 2 threads
user    0m19.369s
sys     0m0.426s. ,

And it seems the 20.88s (seq) and 29.97s (2 thread) should be user time in your cases.

thus 29.97ms actually = core 1 + core 2, the more number of cores, the bigger of the value. not the real time (wall times)

Best Regards,

Ying

Matthias_R_ · ‎05-09-2017

You are right. It seems I got confused by Arch's weird time command. If I use the real one I get:

matthias@laptop ~                                                                 [18:50:45] 
> $ /usr/bin/time -p ./seq.x                                                                
 Walltime:    20.5699980000000     
 Info:            0
 Lwork:       100000
 Liwork:       100000
 LRWORK:       100000
real 21.11
user 21.02
sys 0.08
                                                                                             
matthias@laptop ~                                                                 [18:53:22] 
> $ /usr/bin/time -p ./para.x                                                               
 Walltime:    30.7099960000000     
 Info:            0
 Lwork:       100000
 Liwork:       100000
 LRWORK:       100000
real 16.07
user 31.19
sys 0.12