- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I wrote a piece of code to test the speed of the zheevd function(below). Then I use the Intel link advisor to compile it. These are the times I measured:
sequential: 20.15s
TBB: 30.64s
OpenMP: 30.5s
Why is the sequential code faster then either of the parallel version? Is it possible to speedup the zheevd through parallelization?
program STB !use mkl_service implicit none integer(4) :: num, i call test_herm() contains Subroutine test_herm() Implicit None integer(4), parameter :: N = 4000, LDA = N, LWMAX = 100000 integer(4) :: info, LWORK, LIWORK, LRWORK, i,j real(8) :: r,c integer(4), dimension(LWMAX) :: IWORK real(8), dimension(N) :: W real(8), dimension(LWMAX) :: RWORK complex(16), dimension(LDA, N):: A complex(16), dimension(LWMAX) :: WORK call mkl_set_num_threads(4) call random_seed() do i = 1,N do j = 1,i-1 call random_number(r) call random_number(c) A(i,j) = cmplx(r,c) A(j,i) = conjg(A(i,j)) enddo enddo do i = 1,N call random_number(r) A(i,i) = cmplx(r,0) enddo LWORK = LWMAX LIWORK = LWMAX LRWORK = LWMAX !call zheevd('N', 'L', N, A, LDA, W, WORK, LWORK, RWORK, & !LRWORK, IWORK, LIWORK, info) CALL ZHEEVD( 'N', 'Lower', N, A, LDA, W, WORK, LWORK, RWORK,& LRWORK, IWORK, LIWORK, INFO ) write (*,*) "Info: ", info write (*,*) "Lwork: ", LWORk write (*,*) "Liwork: ", LIWORK write (*,*) "LRWORK: ", LRWORK !write (*,*) W End Subroutine test_herm end program STB
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
From the MKL result, it seems the parallel version Nthr (thread number ) =2 is faster and sequential version: 14.54s CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:2, vs. 19.64s CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1
what kind of your OS? why your time result is like ./seq
.x 20,88s user 0,08s system 99% cpu 20,969 total
Seem not the real one :
real 0m10.755s -----> 2 threads
user 0m19.369s
sys 0m0.426s. ,
And it seems the 20.88s (seq) and 29.97s (2 thread) should be user time in your cases.
thus 29.97ms actually = core 1 + core 2, the more number of cores, the bigger of the value. not the real time (wall times)
Best Regards,
Ying
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Matthias,
Could you please tell what kind of the CPU and MKL version are you using?
I did checked on one Xeon processor:
: Intel(R) Xeon(R) CPU E5-2699 v3 @ 2.30GHz, 2 processors, 18x2 =36 physical cores, support AVX2
MKL 2017 update 3 .
Please note: mkl_set_num_threads(4) in the code.
Test 1: OpenMP with 4 threads
ifort zheevd.f90 -mkl
time ./a.out
real 0m6.386s
user 0m21.271s
sys 0m0.445s
Test 2: Sequential
[yhu5_new@hsw-ep01 ~]$ ifort zheevd.f90 -mkl:sequential -o z_s.out
[yhu5_new@hsw-ep01 ~]$ time ./z_s.out
Info: 0
Lwork: 100000
Liwork: 100000
LRWORK: 100000
real 0m23.494s
user 0m23.174s
sys 0m0.273s
Test 3: TBB
[yhu5_new@hsw-ep01 ~]$ ifort zheevd.f90 -Wl,--start-group ${MKLROOT}/lib/intel64/libmkl_intel_lp64.a ${MKLROOT}/lib/intel64/libmkl_tbb_thread.a ${MKLROOT}/lib/intel64/libmkl_core.a -Wl,--end-group -ltbb -lstdc++ -lpthread -lm -ldl -o z_b.out
[yhu5_new@hsw-ep01 ~]$ time ./z_b.out
Info: 0
Lwork: 100000
Liwork: 100000
LRWORK: 100000
real 0m4.992s
user 1m47.180s
sys 0m9.120s
Test 4: remove the mkl_set_num_threads(4) for OpenMP threads.
[[yhu5_new@hsw-ep01 ~]$ ifort zheevd.f90 -mkl -o z_o.out
[yhu5_new@hsw-ep01 ~]$ time ./z_o.out
Info: 0
Lwork: 100000
Liwork: 100000
LRWORK: 100000
real 0m4.305s
user 1m45.560s
sys 0m1.946s
from the test, the parallel result is better than sequential.
Best Regards,
Ying
I use the below to verify the linked library: OpenMP, TBB, Sequential
[yhu5_new@hsw-ep01 ~]$ ldd z_o.out
linux-vdso.so.1 => (0x00007ffd7970b000)
libmkl_intel_lp64.so => /opt/intel/compilers_and_libraries_2017.3.191/linux/mkl/lib/intel64_lin/libmkl_intel_lp64.so (0x00007fcf318fa000)
libmkl_intel_thread.so => /opt/intel/compilers_and_libraries_2017.3.191/linux/mkl/lib/intel64_lin/libmkl_intel_thread.so (0x00007fcf2fe68000)
libmkl_core.so => /opt/intel/compilers_and_libraries_2017.3.191/linux/mkl/lib/intel64_lin/libmkl_core.so (0x00007fcf2e375000)
libiomp5.so => /opt/intel/compilers_and_libraries_2017.3.191/linux/compiler/lib/intel64/libiomp5.so (0x00007fcf2dfd2000)
libm.so.6 => /lib64/libm.so.6 (0x00007fcf2dcaa000)
libpthread.so.0 => /lib64/libpthread.so.0 (0x00007fcf2da8e000)
libc.so.6 => /lib64/libc.so.6 (0x00007fcf2d6cc000)
libgcc_s.so.1 => /lib64/libgcc_s.so.1 (0x00007fcf2d4b6000)
libdl.so.2 => /lib64/libdl.so.2 (0x00007fcf2d2b2000)
/lib64/ld-linux-x86-64.so.2 (0x00007fcf3233c000)
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
This is the version I have installed:
$ export | grep MKL
MKLROOT=/opt/intel/compilers_and_libraries_2017.3.191/linux/mkl
and this is my cpu
$ cat /proc/cpuinfo processor : 0 vendor_id : GenuineIntel cpu family : 6 model : 61 model name : Intel(R) Core(TM) i5-5200U CPU @ 2.20GHz stepping : 4 microcode : 0x21 cpu MHz : 1588.098 cache size : 3072 KB physical id : 0 siblings : 4 core id : 0 cpu cores : 2 apicid : 0 initial apicid : 0 fpu : yes fpu_exception : yes cpuid level : 20 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch epb intel_pt tpr_shadow vnmi flexpriority ept vpid fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid rdseed adx smap xsaveopt dtherm ida arat pln pts bugs : bogomips : 4391.13 clflush size : 64 cache_alignment : 64 address sizes : 39 bits physical, 48 bits virtual power management: processor : 1 vendor_id : GenuineIntel cpu family : 6 model : 61 model name : Intel(R) Core(TM) i5-5200U CPU @ 2.20GHz stepping : 4 microcode : 0x21 cpu MHz : 1306.787 cache size : 3072 KB physical id : 0 siblings : 4 core id : 0 cpu cores : 2 apicid : 1 initial apicid : 1 fpu : yes fpu_exception : yes cpuid level : 20 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch epb intel_pt tpr_shadow vnmi flexpriority ept vpid fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid rdseed adx smap xsaveopt dtherm ida arat pln pts bugs : bogomips : 4393.10 clflush size : 64 cache_alignment : 64 address sizes : 39 bits physical, 48 bits virtual power management: processor : 2 vendor_id : GenuineIntel cpu family : 6 model : 61 model name : Intel(R) Core(TM) i5-5200U CPU @ 2.20GHz stepping : 4 microcode : 0x21 cpu MHz : 1130.078 cache size : 3072 KB physical id : 0 siblings : 4 core id : 1 cpu cores : 2 apicid : 2 initial apicid : 2 fpu : yes fpu_exception : yes cpuid level : 20 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch epb intel_pt tpr_shadow vnmi flexpriority ept vpid fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid rdseed adx smap xsaveopt dtherm ida arat pln pts bugs : bogomips : 4393.54 clflush size : 64 cache_alignment : 64 address sizes : 39 bits physical, 48 bits virtual power management: processor : 3 vendor_id : GenuineIntel cpu family : 6 model : 61 model name : Intel(R) Core(TM) i5-5200U CPU @ 2.20GHz stepping : 4 microcode : 0x21 cpu MHz : 1299.938 cache size : 3072 KB physical id : 0 siblings : 4 core id : 1 cpu cores : 2 apicid : 3 initial apicid : 3 fpu : yes fpu_exception : yes cpuid level : 20 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch epb intel_pt tpr_shadow vnmi flexpriority ept vpid fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid rdseed adx smap xsaveopt dtherm ida arat pln pts bugs : bogomips : 4393.44 clflush size : 64 cache_alignment : 64 address sizes : 39 bits physical, 48 bits virtual power management:
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Matthias ,
Thanks for letting us know the CPU type. Intel(R) Core(TM) i5-5200U CPU @ 2.20GHz is 2 physical cores machine, so generally, MKL will use 2 thread automatically.
How do you build the application?
Could you please remove call mkl_set_num_threads(4) in the code , then try to run the below command
>export MKL_VERBOSE=1
>time ./your.out
and copy the result here.
Thanks
Ying
[yhu5_new@hsw-ep01 ~]$ export MKL_VERBOSE=1
[yhu5_new@hsw-ep01 ~]$ time ./z_o1.out
MKL_VERBOSE Intel(R) MKL 2017.0 Update 2 Product build 20170126 for Intel(R) 64 architecture Intel(R) Advanced Vector Extensions 2 (Intel(R) AVX2) enabled processors, Lnx 2.30GHz lp64 intel_thread NMICDev:0
MKL_VERBOSE ZHEEVD(N,L,4000,0x6e1760,4000,0x1ef29960,0x1ef31660,100000,0x1f23ea60,100000,0x1f301f60,100000,0) 22.04s CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 WDiv:HOST:+0.000
Info: 0
Lwork: 100000
Liwork: 100000
LRWORK: 100000
real 0m23.803s -> sequential
user 0m23.396s
sys 0m0.392s
[yhu5_new@hsw-ep01 ~]$ ifort zheevd.f90 -mkl -o z_o1.out
[yhu5_new@hsw-ep01 ~]$ export OMP_NUM_THREADS=2
[yhu5_new@hsw-ep01 ~]$ time ./z_o1.out
MKL_VERBOSE Intel(R) MKL 2017.0 Update 2 Product build 20170126 for Intel(R) 64 architecture Intel(R) Advanced Vector Extensions 2 (Intel(R) AVX2) enabled processors, Lnx 2.30GHz lp64 intel_thread NMICDev:0
MKL_VERBOSE ZHEEVD(N,L,4000,0x6e1760,4000,0x1ef29960,0x1ef31660,100000,0x1f23ea60,100000,0x1f301f60,100000,0) 9.17s CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:2 WDiv:HOST:+0.000
Info: 0
Lwork: 100000
Liwork: 100000
LRWORK: 100000
real 0m10.755s -----> 2 threads
user 0m19.369s
sys 0m0.426s
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I did what you suggested and this is the result:
matthias@laptop ~ [8:19:14] > $ ifort test.f90 -mkl:sequential -o seq.x matthias@laptop ~ [8:19:20] > $ ifort test.f90 -mkl -o para.x matthias@laptop ~ [8:19:29] > $ time ./para.x MKL_VERBOSE Intel(R) MKL 2017.0 Update 2 Product build 20170126 for Intel(R) 64 architecture Intel(R) Advanced Vector Extensions 2 (Intel(R) AVX2) enabled processors, Lnx 2.20GHz lp64 intel_thread NMICDev:0 MKL_VERBOSE ZHEEVD(N,L,4000,0x6e1520,4000,0xfb05720,0xfb0d420,100000,0xfc93e20,100000,0xfd57320,100000,0) 14.54s CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:2 WDiv:HOST:+0.000 Info: 0 Lwork: 100000 Liwork: 100000 LRWORK: 100000 ./para.x 29,97s user 0,13s system 191% cpu 15,686 total matthias@laptop ~ [8:19:50] > $ time ./seq.x MKL_VERBOSE Intel(R) MKL 2017.0 Update 2 Product build 20170126 for Intel(R) 64 architecture Intel(R) Advanced Vector Extensions 2 (Intel(R) AVX2) enabled processors, Lnx 2.20GHz lp64 sequential MKL_VERBOSE ZHEEVD(N,L,4000,0x6e0500,4000,0xfb04700,0xfb0c400,100000,0xfc92e00,100000,0xfd56300,100000,0) 19.64s CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 Info: 0 Lwork: 100000 Liwork: 100000 LRWORK: 100000 ./seq.x 20,88s user 0,08s system 99% cpu 20,969 total
Do I need to run this on a system with more than 2 physical cores to see a speedup?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
If you force mkl to use more than 1 thread per core, you will not expect full performance. So you will need more than 2 physical cores to exceed 2x parallel speedup.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I removed the
call mkl_set_num_threads(4)
line. In my results above you can see that the parallel version uses 2 cores( 195%), which i confirmed in htop. I would expect the parallel version to best roughly twice as fast (or faster at all) for 2 cores. This is not the case. The sequential version ist faster, as you can see above.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
From the MKL result, it seems the parallel version Nthr (thread number ) =2 is faster and sequential version: 14.54s CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:2, vs. 19.64s CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1
what kind of your OS? why your time result is like ./seq
.x 20,88s user 0,08s system 99% cpu 20,969 total
Seem not the real one :
real 0m10.755s -----> 2 threads
user 0m19.369s
sys 0m0.426s. ,
And it seems the 20.88s (seq) and 29.97s (2 thread) should be user time in your cases.
thus 29.97ms actually = core 1 + core 2, the more number of cores, the bigger of the value. not the real time (wall times)
Best Regards,
Ying
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
You are right. It seems I got confused by Arch's weird time command. If I use the real one I get:
matthias@laptop ~ [18:50:45] > $ /usr/bin/time -p ./seq.x Walltime: 20.5699980000000 Info: 0 Lwork: 100000 Liwork: 100000 LRWORK: 100000 real 21.11 user 21.02 sys 0.08 matthias@laptop ~ [18:53:22] > $ /usr/bin/time -p ./para.x Walltime: 30.7099960000000 Info: 0 Lwork: 100000 Liwork: 100000 LRWORK: 100000 real 16.07 user 31.19 sys 0.12

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page