SandyBridge serial vector performance

burnt99 · ‎10-15-2012

I was attempting to optimize some code for the Nehalem/Westmere/SandyBridge Xeons, and I was surprised to find that the vector code was slower than the scalar code. So I came up with a small serial test code to compare the performance of scalar versus vector code, and on all of the above Xeons, the vector code generally performed worse, unless math functions are involved. I'm guessing this is the memory wall, since the vector math function (which have many more floating operations per memory reference) loops perform around twice as fast as the scalar versions, as we might expect.

So I'm wondering whether this is caused by the vector memory references NOT going thru cache, but the scalar memory references do? If that is the case, is there a compiler option or compiler directive that allows you to specify that the vector loads should go thru cache? If that is not the case, could I get an explanation for the vector slowdown?

Thanks.

Details follow:

The code is listed below (not meant for redistribution, it's just a quick test code), and the options used to compile are:

ifort -g test.f90 -openmp -o omp_alloc \
-mcmodel=large -O2 -vec-report3 -opt-report 2 -opt-report-file=opt.rpt.2

Here are the results of running the vector/scalar code on a 2.6 GHz SandyBridge node:

Number of processors is             16
Number of threads requested =       16
tick= 1.000000000000000E-006 time= 2.017974853515625E-003

TEST02
Time vectorized and scalar operations:

Data vectors will be of minimum size          256
Data vectors will be of maximum size       262144
Number of repetitions of the operation:            3

    y(1:n) = x(1:n) + wtime1*z(1:n) + wtime1*p(1:n)
y(4)=   1398173102.88449

Timing results:

Vector Size TVec#1   TVec#2   TVec#3   TSca#1   TSca#2   TSca#3   AVGRatio

       256 0.000001 0.000000 0.000000 0.000001 0.000000 0.000000   1.2500
       512 0.000001 0.000000 0.000001 0.000001 0.000001 0.000000   1.1250
      1024 0.000003 0.000002 0.000001 0.000003 0.000001 0.000001   0.8077
      2048 0.000008 0.000005 0.000005 0.000006 0.000003 0.000003   0.6711
      4096 0.000017 0.000009 0.000010 0.000011 0.000006 0.000006   0.6400
      8192 0.000037 0.000018 0.000018 0.000042 0.000011 0.000011   0.8762
     16384 0.000078 0.000043 0.000044 0.000065 0.000035 0.000034   0.8133
     32768 0.000154 0.000084 0.000085 0.000123 0.000061 0.000062   0.7622
     65536 0.000304 0.000164 0.000164 0.000235 0.000114 0.000113   0.7313
    131072 0.000601 0.000322 0.000325 0.000217 0.000211 0.000218   0.5178
    262144 0.001197 0.000638 0.000642 0.000424 0.000424 0.000426   0.5142

    y(1:n) = PI *   x(1:n)
y(4)=   2.51267519336088

Timing results:

Vector Size TVec#1   TVec#2   TVec#3   TSca#1   TSca#2   TSca#3   AVGRatio

       256 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000      NaN
       512 0.000000 0.000001 0.000001 0.000001 0.000000 0.000001   1.0000
      1024 0.000001 0.000002 0.000002 0.000002 0.000001 0.000001   0.8500
      2048 0.000004 0.000004 0.000004 0.000001 0.000002 0.000002   0.3922
      4096 0.000008 0.000008 0.000008 0.000003 0.000004 0.000004   0.4554
      8192 0.000016 0.000016 0.000016 0.000008 0.000007 0.000008   0.4776
     16384 0.000047 0.000041 0.000047 0.000025 0.000025 0.000026   0.5618
     32768 0.000078 0.000074 0.000079 0.000042 0.000042 0.000042   0.5443
     65536 0.000144 0.000146 0.000145 0.000073 0.000075 0.000075   0.5121
    131072 0.000280 0.000270 0.000276 0.000139 0.000140 0.000140   0.5071
    262144 0.000538 0.000539 0.000540 0.000269 0.000274 0.000270   0.5029

    y(1:n) = sqrt ( x(1:n) )
y(4)= 0.644873294001390

Timing results:

Vector Size TVec#1   TVec#2   TVec#3   TSca#1   TSca#2   TSca#3   AVGRatio

       256 0.000002 0.000001 0.000001 0.000002 0.000002 0.000002   1.5294
       512 0.000002 0.000002 0.000002 0.000004 0.000004 0.000004   1.9615
      1024 0.000004 0.000004 0.000004 0.000009 0.000009 0.000008   2.1800
      2048 0.000008 0.000009 0.000009 0.000016 0.000016 0.000016   1.8611
      4096 0.000017 0.000017 0.000017 0.000033 0.000033 0.000033   1.9256
      8192 0.000033 0.000033 0.000035 0.000066 0.000069 0.000066   1.9976
     16384 0.000069 0.000069 0.000069 0.000138 0.000138 0.000138   1.9988
     32768 0.000137 0.000135 0.000135 0.000273 0.000273 0.000271   2.0064
     65536 0.000267 0.000267 0.000267 0.000535 0.000536 0.000538   2.0089
    131072 0.000534 0.000534 0.000534 0.001066 0.001067 0.001074   2.0022
    262144 0.001063 0.001063 0.001063 0.002127 0.002127 0.002127   2.0011

    y(1:n) = exp ( x(1:n) )
y(4)=   1.74962731479888

Timing results:

Vector Size TVec#1   TVec#2   TVec#3   TSca#1   TSca#2   TSca#3   AVGRatio

       256 0.000001 0.000001 0.000001 0.000002 0.000003 0.000002   2.5000
       512 0.000002 0.000002 0.000002 0.000005 0.000004 0.000004   2.2500
      1024 0.000004 0.000004 0.000004 0.000009 0.000010 0.000009   2.3600
      2048 0.000007 0.000007 0.000007 0.000018 0.000018 0.000019   2.6136
      4096 0.000015 0.000015 0.000014 0.000036 0.000036 0.000036   2.4620
      8192 0.000030 0.000030 0.000030 0.000073 0.000073 0.000074   2.4548
     16384 0.000059 0.000059 0.000059 0.000146 0.000146 0.000147   2.4778
     32768 0.000118 0.000117 0.000117 0.000293 0.000295 0.000293   2.5051
     65536 0.000240 0.000238 0.000235 0.000593 0.000586 0.000597   2.4913
    131072 0.000475 0.000473 0.000470 0.001183 0.001177 0.001173   2.4913
    262144 0.000945 0.000942 0.000941 0.002356 0.002348 0.002353   2.4953

Here is the test.f90 code:

Program test

use omp_lib

integer proc_num
integer thread_num
real*8, dimension(:,:), allocatable, target :: x, db
real*8, dimension(:,:,:), allocatable, target :: p
integer*4, dimension(:,:), allocatable, target :: netlist
!double precision function omp_get_wtick(), omp_get_wtime()
double precision t1, t2, tick

t1 = omp_get_wtime()
tick = omp_get_wtick()
proc_num = omp_get_num_procs ( )

thread_num = proc_num

call omp_set_num_threads ( thread_num )

write ( *, '(a)' ) ' '
write ( *, '(a,i8)' ) ' Number of processors is ', proc_num
write ( *, '(a,i8)' ) ' Number of threads requested = ', thread_num
t2 = omp_get_wtime()
print*, "tick=",tick," time=",t2-t1
call test02 ( )
! call test03 ( )
end

subroutine test02 ( )
!*****************************************************************************80
!
!! TEST02 times the vectorized EXP routine.
!
! Licensing:
!
! This code is distributed under the GNU LGPL license.
!
! Modified:
!
! 10 July 2008
!
! Author:
!
! John Burkardt
!
use omp_lib

integer ( kind = 4 ), parameter :: n_log_min = 8
integer ( kind = 4 ), parameter :: n_log_max = 18
integer ( kind = 4 ), parameter :: n_min = 2**n_log_min
integer ( kind = 4 ), parameter :: n_max = 2**n_log_max
integer ( kind = 4 ), parameter :: n_rep = 3

real ( kind = 8 ) delta(3,n_log_max,n_rep)
integer ( kind = 4 ) func
integer ( kind = 4 ) i_rep
integer ( kind = 4 ) n
integer ( kind = 4 ) n_log
real ( kind = 8 ), parameter :: pi = 3.141592653589793D+00
real ( kind = 8 ) wtime1
real ( kind = 8 ) wtime2
real ( kind = 8 ) x(n_max), z(n_max), p(n_max)
real ( kind = 8 ) y(n_max)

write ( *, '(a)' ) ' '
write ( *, '(a)' ) 'TEST02'
write ( *, '(a)' ) ' Time vectorized and scalar operations:'
write ( *, '(a)' ) ' '
! write ( *, '(a)' ) ' y(1:n) = x(1:n) '
! write ( *, '(a)' ) ' y(1:n) = PI * x(1:n) '
! write ( *, '(a)' ) ' y(1:n) = sqrt ( x(1:n) )'
! write ( *, '(a)' ) ' y(1:n) = exp ( x(1:n) )'
! write ( *, '(a)' ) ' '
write ( *, '(a,i12)' ) ' Data vectors will be of minimum size ', n_min
write ( *, '(a,i12)' ) ' Data vectors will be of maximum size ', n_max
write ( *, '(a,i12)' ) ' Number of repetitions of the operation: ', n_rep

do func = 1, 4

write ( *, '(a)' ) ' '
if ( func == 1 ) then
write ( *, '(a)' ) ' y(1:n) = x(1:n) + wtime1*z(1:n) + wtime1*p(1:n)'
else if ( func == 2 ) then
write ( *, '(a)' ) ' y(1:n) = PI * x(1:n) '
else if ( func == 3 ) then
write ( *, '(a)' ) ' y(1:n) = sqrt ( x(1:n) )'
else if ( func == 4 ) then
write ( *, '(a)' ) ' y(1:n) = exp ( x(1:n) )'
end if
do i_rep = 0, n_rep

do n_log = n_log_min, n_log_max

n = 2**( n_log )

call random_number ( harvest = x(1:n) )
call random_number ( harvest = z(1:n) )
call random_number ( harvest = p(1:n) )

wtime1 = omp_get_wtime ( )

if ( func == 1 ) then
! y(1:n) = x(1:n) + wtime1*z(1:n) + wtime1*p(1:n)
do i = 1, n
y(i) = x(i) + wtime1*z(i) + wtime1*p(i)
end do
else if ( func == 2 ) then
! y(1:n) = pi * x(1:n)
do i = 1, n
y(i) = pi * x(i)
end do
else if ( func == 3 ) then
! y(1:n) = sqrt ( x(1:n) )
do i = 1, n
y(i) = sqrt ( x(i) )
end do
else if ( func == 4 ) then
! y(1:n) = exp ( x(1:n) )
do i = 1, n
y(i) = exp ( x(i) )
end do
end if

wtime2 = omp_get_wtime ( )

delta(1,n_log,i_rep) = wtime2 - wtime1

end do
do n_log = n_log_min, n_log_max

n = 2**( n_log )

call random_number ( harvest = x(1:n) )
call random_number ( harvest = z(1:n) )
call random_number ( harvest = p(1:n) )

wtime1 = omp_get_wtime ( )

if ( func == 1 ) then
!DIR$ NOVECTOR
do i = 1, n
y(i) = x(i) + wtime1*z(i) + wtime1*p(i)
end do
else if ( func == 2 ) then
!DIR$ NOVECTOR
do i = 1, n
y(i) = pi * x(i)
end do
else if ( func == 3 ) then
!DIR$ NOVECTOR
do i = 1, n
y(i) = sqrt ( x(i) )
end do
else if ( func == 4 ) then
!DIR$ NOVECTOR
do i = 1, n
y(i) = exp ( x(i) )
end do
end if

wtime2 = omp_get_wtime ( )

delta(2,n_log,i_rep) = wtime2 - wtime1

end do

end do
! The following statement prevents the compiler from optimizing away the scalar operations:
print*, "y(4)=",y(4)
write ( *, '(a)' ) ' '
write ( *, '(a)' ) ' Timing results:'
write ( *, '(a)' ) ' '
write ( *, '(a)' ) ' Vector Size TVec#1 TVec#2 ' &
// 'TVec#3 TSca#1 TSca#2 TSca#3 AVGRatio'
write ( *, '(a)' ) ' '
do n_log = n_log_min, n_log_max
n = 2**( n_log )
write ( *, '(i10,3f9.6,3f9.6,1f9.4)' ) n, &
delta(1,n_log,1:n_rep), delta(2,n_log,1:n_rep), &
SUM(delta(2,n_log,1:n_rep))/SUM(delta(1,n_log,1:n_rep))
end do

end do

return
end

Here is the start of the /proc/cpuinfo:

processor       : 0
vendor_id       : GenuineIntel
cpu family      : 6
model           : 45
model name      : Intel(R) Xeon(R) CPU E5-2670 0 @ 2.60GHz
stepping        : 7
cpu MHz         : 2600.084
cache size      : 20480 KB
physical id     : 0
siblings        : 8
core id         : 0
cpu cores       : 8
apicid          : 0
initial apicid : 0
fpu             : yes
fpu_exception   : yes
cpuid level     : 13
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 apic mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good xtopology nonstop_tsc aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 cx16 xtpr pdcm dca sse4_1 sse4_2 x2apic popcnt aes xsave avx lahf_lm ida arat epb xsaveopt pln pts dts tpr_shadow vnmi flexpriority ept vpid
bogomips        : 5200.16
clflush size    : 64
cache_alignment : 64
address sizes   : 46 bits physical, 48 bits virtual
power management:

And OS info:

> uname -a
Linux node1141 2.6.32-220.23.1.el6.x86_64 #1 SMP Mon Jun 18 18:58:52 BST 2012 x86_64 x86_64 x86_64 GNU/Linux

Xiaoping_D_Intel · ‎11-04-2013

Here compiler generated streaming store for the vectorized loops. Using option "-opt-streaming-stores never" will disable it and make vectorized version faster than scalar version.

jimdempseyatthecove · ‎11-05-2013

Setup your timing loop repetition count such that the run time between omp_get_wtime() is large enough such that any system overhead becomes neglegible. A 1 second run time omp_get_wtime()'s will provide reasonable statistics. You can divide the elapse time per test by number of iterations to get the per iteration time for the test.

Jim Dempsey

TimP · ‎11-06-2013

In order to make streaming stores work, you must either make the arrays big enough to consume several times the last level cache, or make the individual sections of the benchmark write into distinct arrays.

Some of these issues have been discussed at length on these forums in connection with the Maccalpin STREAM benchmark which has buried in its rules that you must make the tests bigger if you are observing cache re-use effects.