- Als neu kennzeichnen
- Lesezeichen
- Abonnieren
- Stummschalten
- RSS-Feed abonnieren
- Kennzeichnen
- Anstößigen Inhalt melden
I was attempting to optimize some code for the Nehalem/Westmere/SandyBridge Xeons, and I was surprised to find that the vector code was slower than the scalar code. So I came up with a small serial test code to compare the performance of scalar versus vector code, and on all of the above Xeons, the vector code generally performed worse, unless math functions are involved. I'm guessing this is the memory wall, since the vector math function (which have many more floating operations per memory reference) loops perform around twice as fast as the scalar versions, as we might expect.
So I'm wondering whether this is caused by the vector memory references NOT going thru cache, but the scalar memory references do? If that is the case, is there a compiler option or compiler directive that allows you to specify that the vector loads should go thru cache? If that is not the case, could I get an explanation for the vector slowdown?
Thanks.
Details follow:
The code is listed below (not meant for redistribution, it's just a quick test code), and the options used to compile are:
ifort -g test.f90 -openmp -o omp_alloc \
-mcmodel=large -O2 -vec-report3 -opt-report 2 -opt-report-file=opt.rpt.2
Here are the results of running the vector/scalar code on a 2.6 GHz SandyBridge node:
Number of processors is 16
Number of threads requested = 16
tick= 1.000000000000000E-006 time= 2.017974853515625E-003
TEST02
Time vectorized and scalar operations:
Data vectors will be of minimum size 256
Data vectors will be of maximum size 262144
Number of repetitions of the operation: 3
y(1:n) = x(1:n) + wtime1*z(1:n) + wtime1*p(1:n)
y(4)= 1398173102.88449
Timing results:
Vector Size TVec#1 TVec#2 TVec#3 TSca#1 TSca#2 TSca#3 AVGRatio
256 0.000001 0.000000 0.000000 0.000001 0.000000 0.000000 1.2500
512 0.000001 0.000000 0.000001 0.000001 0.000001 0.000000 1.1250
1024 0.000003 0.000002 0.000001 0.000003 0.000001 0.000001 0.8077
2048 0.000008 0.000005 0.000005 0.000006 0.000003 0.000003 0.6711
4096 0.000017 0.000009 0.000010 0.000011 0.000006 0.000006 0.6400
8192 0.000037 0.000018 0.000018 0.000042 0.000011 0.000011 0.8762
16384 0.000078 0.000043 0.000044 0.000065 0.000035 0.000034 0.8133
32768 0.000154 0.000084 0.000085 0.000123 0.000061 0.000062 0.7622
65536 0.000304 0.000164 0.000164 0.000235 0.000114 0.000113 0.7313
131072 0.000601 0.000322 0.000325 0.000217 0.000211 0.000218 0.5178
262144 0.001197 0.000638 0.000642 0.000424 0.000424 0.000426 0.5142
y(1:n) = PI * x(1:n)
y(4)= 2.51267519336088
Timing results:
Vector Size TVec#1 TVec#2 TVec#3 TSca#1 TSca#2 TSca#3 AVGRatio
256 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 NaN
512 0.000000 0.000001 0.000001 0.000001 0.000000 0.000001 1.0000
1024 0.000001 0.000002 0.000002 0.000002 0.000001 0.000001 0.8500
2048 0.000004 0.000004 0.000004 0.000001 0.000002 0.000002 0.3922
4096 0.000008 0.000008 0.000008 0.000003 0.000004 0.000004 0.4554
8192 0.000016 0.000016 0.000016 0.000008 0.000007 0.000008 0.4776
16384 0.000047 0.000041 0.000047 0.000025 0.000025 0.000026 0.5618
32768 0.000078 0.000074 0.000079 0.000042 0.000042 0.000042 0.5443
65536 0.000144 0.000146 0.000145 0.000073 0.000075 0.000075 0.5121
131072 0.000280 0.000270 0.000276 0.000139 0.000140 0.000140 0.5071
262144 0.000538 0.000539 0.000540 0.000269 0.000274 0.000270 0.5029
y(1:n) = sqrt ( x(1:n) )
y(4)= 0.644873294001390
Timing results:
Vector Size TVec#1 TVec#2 TVec#3 TSca#1 TSca#2 TSca#3 AVGRatio
256 0.000002 0.000001 0.000001 0.000002 0.000002 0.000002 1.5294
512 0.000002 0.000002 0.000002 0.000004 0.000004 0.000004 1.9615
1024 0.000004 0.000004 0.000004 0.000009 0.000009 0.000008 2.1800
2048 0.000008 0.000009 0.000009 0.000016 0.000016 0.000016 1.8611
4096 0.000017 0.000017 0.000017 0.000033 0.000033 0.000033 1.9256
8192 0.000033 0.000033 0.000035 0.000066 0.000069 0.000066 1.9976
16384 0.000069 0.000069 0.000069 0.000138 0.000138 0.000138 1.9988
32768 0.000137 0.000135 0.000135 0.000273 0.000273 0.000271 2.0064
65536 0.000267 0.000267 0.000267 0.000535 0.000536 0.000538 2.0089
131072 0.000534 0.000534 0.000534 0.001066 0.001067 0.001074 2.0022
262144 0.001063 0.001063 0.001063 0.002127 0.002127 0.002127 2.0011
y(1:n) = exp ( x(1:n) )
y(4)= 1.74962731479888
Timing results:
Vector Size TVec#1 TVec#2 TVec#3 TSca#1 TSca#2 TSca#3 AVGRatio
256 0.000001 0.000001 0.000001 0.000002 0.000003 0.000002 2.5000
512 0.000002 0.000002 0.000002 0.000005 0.000004 0.000004 2.2500
1024 0.000004 0.000004 0.000004 0.000009 0.000010 0.000009 2.3600
2048 0.000007 0.000007 0.000007 0.000018 0.000018 0.000019 2.6136
4096 0.000015 0.000015 0.000014 0.000036 0.000036 0.000036 2.4620
8192 0.000030 0.000030 0.000030 0.000073 0.000073 0.000074 2.4548
16384 0.000059 0.000059 0.000059 0.000146 0.000146 0.000147 2.4778
32768 0.000118 0.000117 0.000117 0.000293 0.000295 0.000293 2.5051
65536 0.000240 0.000238 0.000235 0.000593 0.000586 0.000597 2.4913
131072 0.000475 0.000473 0.000470 0.001183 0.001177 0.001173 2.4913
262144 0.000945 0.000942 0.000941 0.002356 0.002348 0.002353 2.4953
Here is the test.f90 code:
Program test
use omp_lib
integer proc_num
integer thread_num
real*8, dimension(:,:), allocatable, target :: x, db
real*8, dimension(:,:,:), allocatable, target :: p
integer*4, dimension(:,:), allocatable, target :: netlist
!double precision function omp_get_wtick(), omp_get_wtime()
double precision t1, t2, tick
t1 = omp_get_wtime()
tick = omp_get_wtick()
proc_num = omp_get_num_procs ( )
thread_num = proc_num
call omp_set_num_threads ( thread_num )
write ( *, '(a)' ) ' '
write ( *, '(a,i8)' ) ' Number of processors is ', proc_num
write ( *, '(a,i8)' ) ' Number of threads requested = ', thread_num
t2 = omp_get_wtime()
print*, "tick=",tick," time=",t2-t1
call test02 ( )
! call test03 ( )
end
subroutine test02 ( )
!*****************************************************************************80
!
!! TEST02 times the vectorized EXP routine.
!
! Licensing:
!
! This code is distributed under the GNU LGPL license.
!
! Modified:
!
! 10 July 2008
!
! Author:
!
! John Burkardt
!
use omp_lib
integer ( kind = 4 ), parameter :: n_log_min = 8
integer ( kind = 4 ), parameter :: n_log_max = 18
integer ( kind = 4 ), parameter :: n_min = 2**n_log_min
integer ( kind = 4 ), parameter :: n_max = 2**n_log_max
integer ( kind = 4 ), parameter :: n_rep = 3
real ( kind = 8 ) delta(3,n_log_max,n_rep)
integer ( kind = 4 ) func
integer ( kind = 4 ) i_rep
integer ( kind = 4 ) n
integer ( kind = 4 ) n_log
real ( kind = 8 ), parameter :: pi = 3.141592653589793D+00
real ( kind = 8 ) wtime1
real ( kind = 8 ) wtime2
real ( kind = 8 ) x(n_max), z(n_max), p(n_max)
real ( kind = 8 ) y(n_max)
write ( *, '(a)' ) ' '
write ( *, '(a)' ) 'TEST02'
write ( *, '(a)' ) ' Time vectorized and scalar operations:'
write ( *, '(a)' ) ' '
! write ( *, '(a)' ) ' y(1:n) = x(1:n) '
! write ( *, '(a)' ) ' y(1:n) = PI * x(1:n) '
! write ( *, '(a)' ) ' y(1:n) = sqrt ( x(1:n) )'
! write ( *, '(a)' ) ' y(1:n) = exp ( x(1:n) )'
! write ( *, '(a)' ) ' '
write ( *, '(a,i12)' ) ' Data vectors will be of minimum size ', n_min
write ( *, '(a,i12)' ) ' Data vectors will be of maximum size ', n_max
write ( *, '(a,i12)' ) ' Number of repetitions of the operation: ', n_rep
do func = 1, 4
write ( *, '(a)' ) ' '
if ( func == 1 ) then
write ( *, '(a)' ) ' y(1:n) = x(1:n) + wtime1*z(1:n) + wtime1*p(1:n)'
else if ( func == 2 ) then
write ( *, '(a)' ) ' y(1:n) = PI * x(1:n) '
else if ( func == 3 ) then
write ( *, '(a)' ) ' y(1:n) = sqrt ( x(1:n) )'
else if ( func == 4 ) then
write ( *, '(a)' ) ' y(1:n) = exp ( x(1:n) )'
end if
do i_rep = 0, n_rep
do n_log = n_log_min, n_log_max
n = 2**( n_log )
call random_number ( harvest = x(1:n) )
call random_number ( harvest = z(1:n) )
call random_number ( harvest = p(1:n) )
wtime1 = omp_get_wtime ( )
if ( func == 1 ) then
! y(1:n) = x(1:n) + wtime1*z(1:n) + wtime1*p(1:n)
do i = 1, n
y(i) = x(i) + wtime1*z(i) + wtime1*p(i)
end do
else if ( func == 2 ) then
! y(1:n) = pi * x(1:n)
do i = 1, n
y(i) = pi * x(i)
end do
else if ( func == 3 ) then
! y(1:n) = sqrt ( x(1:n) )
do i = 1, n
y(i) = sqrt ( x(i) )
end do
else if ( func == 4 ) then
! y(1:n) = exp ( x(1:n) )
do i = 1, n
y(i) = exp ( x(i) )
end do
end if
wtime2 = omp_get_wtime ( )
delta(1,n_log,i_rep) = wtime2 - wtime1
end do
do n_log = n_log_min, n_log_max
n = 2**( n_log )
call random_number ( harvest = x(1:n) )
call random_number ( harvest = z(1:n) )
call random_number ( harvest = p(1:n) )
wtime1 = omp_get_wtime ( )
if ( func == 1 ) then
!DIR$ NOVECTOR
do i = 1, n
y(i) = x(i) + wtime1*z(i) + wtime1*p(i)
end do
else if ( func == 2 ) then
!DIR$ NOVECTOR
do i = 1, n
y(i) = pi * x(i)
end do
else if ( func == 3 ) then
!DIR$ NOVECTOR
do i = 1, n
y(i) = sqrt ( x(i) )
end do
else if ( func == 4 ) then
!DIR$ NOVECTOR
do i = 1, n
y(i) = exp ( x(i) )
end do
end if
wtime2 = omp_get_wtime ( )
delta(2,n_log,i_rep) = wtime2 - wtime1
end do
end do
! The following statement prevents the compiler from optimizing away the scalar operations:
print*, "y(4)=",y(4)
write ( *, '(a)' ) ' '
write ( *, '(a)' ) ' Timing results:'
write ( *, '(a)' ) ' '
write ( *, '(a)' ) ' Vector Size TVec#1 TVec#2 ' &
// 'TVec#3 TSca#1 TSca#2 TSca#3 AVGRatio'
write ( *, '(a)' ) ' '
do n_log = n_log_min, n_log_max
n = 2**( n_log )
write ( *, '(i10,3f9.6,3f9.6,1f9.4)' ) n, &
delta(1,n_log,1:n_rep), delta(2,n_log,1:n_rep), &
SUM(delta(2,n_log,1:n_rep))/SUM(delta(1,n_log,1:n_rep))
end do
end do
return
end
Here is the start of the /proc/cpuinfo:
processor : 0
vendor_id : GenuineIntel
cpu family : 6
model : 45
model name : Intel(R) Xeon(R) CPU E5-2670 0 @ 2.60GHz
stepping : 7
cpu MHz : 2600.084
cache size : 20480 KB
physical id : 0
siblings : 8
core id : 0
cpu cores : 8
apicid : 0
initial apicid : 0
fpu : yes
fpu_exception : yes
cpuid level : 13
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good xtopology nonstop_tsc aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 cx16 xtpr pdcm dca sse4_1 sse4_2 x2apic popcnt aes xsave avx lahf_lm ida arat epb xsaveopt pln pts dts tpr_shadow vnmi flexpriority ept vpid
bogomips : 5200.16
clflush size : 64
cache_alignment : 64
address sizes : 46 bits physical, 48 bits virtual
power management:
And OS info:
> uname -a
Linux node1141 2.6.32-220.23.1.el6.x86_64 #1 SMP Mon Jun 18 18:58:52 BST 2012 x86_64 x86_64 x86_64 GNU/Linux
Link kopiert
- Als neu kennzeichnen
- Lesezeichen
- Abonnieren
- Stummschalten
- RSS-Feed abonnieren
- Kennzeichnen
- Anstößigen Inhalt melden
Here compiler generated streaming store for the vectorized loops. Using option "-opt-streaming-stores never" will disable it and make vectorized version faster than scalar version.
- Als neu kennzeichnen
- Lesezeichen
- Abonnieren
- Stummschalten
- RSS-Feed abonnieren
- Kennzeichnen
- Anstößigen Inhalt melden
Setup your timing loop repetition count such that the run time between omp_get_wtime() is large enough such that any system overhead becomes neglegible. A 1 second run time omp_get_wtime()'s will provide reasonable statistics. You can divide the elapse time per test by number of iterations to get the per iteration time for the test.
Jim Dempsey
- Als neu kennzeichnen
- Lesezeichen
- Abonnieren
- Stummschalten
- RSS-Feed abonnieren
- Kennzeichnen
- Anstößigen Inhalt melden
In order to make streaming stores work, you must either make the arrays big enough to consume several times the last level cache, or make the individual sections of the benchmark write into distinct arrays.
Some of these issues have been discussed at length on these forums in connection with the Maccalpin STREAM benchmark which has buried in its rules that you must make the tests bigger if you are observing cache re-use effects.

- RSS-Feed abonnieren
- Thema als neu kennzeichnen
- Thema als gelesen kennzeichnen
- Diesen Thema für aktuellen Benutzer floaten
- Lesezeichen
- Abonnieren
- Drucker-Anzeigeseite