Ideal vectorization speed-up with SSE2 and MIC512 - not AVX?

PKM · ‎04-09-2015

Hi

In the process of optimizing a large Fortran research code I have written a simple program that very closely resembles the performance characteristics of the more complicated case. The code essentially ends up spending all its time evaluating exponential functions and square roots in a vectorizable manner, so it is a compute bound problem that should be extremely well suited for Xeon phi and wide vector units in general.

By running the program below I obtained vectorized and unvectorized performance results for SSE3/AVX/Xeon phi compilations. The "funny" thing is that I get virtually ideal vectorization speed-up for SSE3 and on the Xeon Phi, but not for AVX. I am using the latest version of parallel studio on Windows and I run the program on a Xeon E5-2650 v2 with a Xeon Phi 3120. Performance numbers from running the attached code follows below ...

Any idea why the AVX speed-up is so far from ideal?

Any idea why I am not seeing better performance from the Xeon Phi? My code is clearly compute bound, embarrasingly parallel, uses aligned vector instructions, no allocations, yet the Xeon Phi is only 3x faster than the host cpu running off SSE3 instructions. For SSE3 instructions the peak performance of the host cpu should be 166 GFLOPS versus 1000 GFLOPS for the phi. So I would expect something more in line with a 6x difference?

Thank you very much in advance for you advice!

C

CPU SSE3 (NoVec/Vec): 437/871 -> 2.0x vectorization speed-up

CPU AVX (NoVec/Vec): 437/1194 -> 2.7x vectorization speed-up

Xeon Phi (NoVec/Vec): 343/2591 -> 7.6x vectorization speed-up

module mComputations

real*8,dimension(-57:50) :: RAll
real*8,dimension(-55:16) :: LambdaAll,LambdaAll2
real*8,dimension(-55:16) :: Ri,Ei,Fi,Hi,Un,Ui
real*8,dimension(-55:16,1:20) :: RefAll,UAll,E2
real*8,dimension(1:20) :: FiltResp
!dir$ attributes offload : mic :: LambdaAll,LambdaAll2,Ri,Ei,Fi,Hi,Un,Ui,RefAll,Uall,E2,Filtresp,RAll
!DIR$ ATTRIBUTES ALIGN : 64 :: LambdaAll,LambdaAll2,Ri,Ei,Fi,Hi,Un,Ui,RefAll,Uall,E2,Filtresp,RAll
!$OMP THREADPRIVATE(LambdaAll,LambdaAll2,Ri,Ei,Fi,Hi,Un,Ui,RefAll,Uall,E2,Filtresp,RAll)
contains
!dir$ attributes offload : mic :: DoComputations
subroutine DoComputations(iNoModels,iUseVectorization)
! Input:
! iNoModels -> Number of models to calculate. Set high for good statistics on the benchmark.
! iUseVectorization -> If true, the benchmark is run with aligned vector instructions. False, no vectorization.
use omp_lib
implicit none
integer, intent(in) :: iNoModels
logical, intent(in) :: iUseVectorization
integer :: I2,I,J,IJMinCalc,IJMaxCalc,ijmaxloop,ijminloop,NoModels,t,k,Models
real*8 :: SMy,Rs,ki2,NLayM,time,E
real*8 :: Sigma(30),Thick(30), Timebegin,TimeEnd,Val,Kn2
real*8 :: Nom,Denom
real*8 :: Exparg
!DIR$ ATTRIBUTES ALIGN : 64 :: Thick
if (iUseVectorization) then
print *,'Vectorized:'
else
print *,'Unvectorized:'
end if
NLayM=30
Sigma(:)=0.1
Thick(:)=2.5
ijMinCalc=-55
ijMaxCalc=16
ijMinLoop=1
ijMaxLoop=ijMaxCalc-ijmincalc+1
! Variables
TimeBegin=omp_get_wtime()
!Loop over models
NoModels=iNoModels
!$OMP PARALLEL DEFAULT(PRIVATE) SHARED(iUseVectorization,NoModels,NLayM,Thick,Sigma,ijmincalc,ijmaxcalc,ijMinLoop,ijMaxLoop)
!$OMP DO
do Models=1,NoModels
do t=1,31
time=log(2d0)/(1e-6*10**((t-1d0)/10d0))
E=10**0.1
if (iUseVectorization==.false.) then
!DIR$ NO VECTOR
do I = ijmincalc,ijmaxcalc
Val = E**(I)*0.1d0
lambdaAll2(I) = Val*Val
enddo
!Loop over frequencies - 16
do k=1,16
SMy=4*3.14e-7*time
! start from the lowest layer
kn2 = Smy*real(sigma(NLayM))
!DIR$ NO VECTOR
do J=ijmincalc,ijmaxcalc
Un(J) = sqrt(LambdaAll2(J)+kn2)
Fi(J) = 0
enddo
do I2=NLayM-1,1,-1 ! this loop calculates from N-1 to 1 going upwar
rs = SMy*(sigma(I2)-sigma(I2+1))
ki2 = Smy*real(sigma(I2))

!DIR$ NO VECTOR
do J=ijmincalc,ijmaxcalc
!The critical loop is here!
Ui(J) = sqrt(LambdaAll2(J)+ki2)
Hi(J) = Ui(J)+Un(J)
Ri(J) = rs/(Hi(J)*Hi(J))
exparg = -2.d0*ui(j)*Thick(I2)
Ei(J) = exp(exparg)
nom = (Ei(J)*(Ri(J)+Fi(J)))
denom = (1.d0+Ri(J)*Fi(J))
Fi(J) = nom/denom
Un(J) = Ui(J)
end do
end do
enddo
else
!DIR$ VECTOR ALIGNED
do I = ijmincalc,ijmaxcalc
Val = E**(I)*0.1d0
lambdaAll2(I) = Val*Val
enddo
!Loop over frequencies - 16
do k=1,16
SMy=4*3.14e-7*time
! start from the lowest layer
kn2 = Smy*real(sigma(NLayM))
!DIR$ VECTOR ALIGNED
do J=ijmincalc,ijmaxcalc
Un(J) = sqrt(LambdaAll2(J)+kn2)
Fi(J) = 0
enddo
do I2=NLayM-1,1,-1 ! this loop calculates from N-1 to 1 going upwar
rs = SMy*(sigma(I2)-sigma(I2+1))
ki2 = Smy*real(sigma(I2))

!DIR$ VECTOR ALIGNED
do J=ijmincalc,ijmaxcalc
!The critical loop is here!
Ui(J) = sqrt(LambdaAll2(J)+ki2)
Hi(J) = Ui(J)+Un(J)
Ri(J) = rs/(Hi(J)*Hi(J))
exparg = -2.d0*ui(j)*Thick(I2)
Ei(J) = exp(exparg)
nom = (Ei(J)*(Ri(J)+Fi(J)))
denom = (1.d0+Ri(J)*Fi(J))
Fi(J) = nom/denom
Un(J) = Ui(J)
end do
end do
enddo
end if
end do
end do
!$OMP END DO
!$OMP END PARALLEL
TimeEnd=omp_get_wtime()
print *,'Models/s=',NoModels*1d0/(TimeEnd-TimeBegin)
end subroutine DoComputations
end module mComputations

program kernelopt
use mComputations
implicit none
integer :: i,j
real :: Depth(50),Values(50)
logical :: UseVectorization
!!set enviroment variable KMP_AFFINITY=scatter before running
!Perform the same calculation with and without vectorization
print *,'CPU benchmark'
call omp_set_num_threads(8)
call DoComputations(8*4000,.true.)
call DoComputations(8*1000,.false.)
!stop
print *,'Xeon phi benchmark'
!DIR$ OFFLOAD BEGIN TARGET(mic:0)
call omp_set_num_threads(224)
call DoComputations(224*800,.true.)
call DoComputations(224*100,.False.)

!DIR$ END OFFLOAD
end program kernelopt

TimP · ‎04-09-2015

Among possible reasons, if by ideal you mean that AVX might have twice the parallelism of SSE3 on an "ideal" CPU:

simd sqrt and divide have to be split by hardware into 128-bit chunks for Ivy Bridge architecture, so AVX may not be faster than SSE3.

if your critical loops aren't blocked for L1 cache you have a 128-bit width limitation on L2 access on Ivy Bridge.

But, you have put NO VECTOR on what you call the critical loop, so you can't expect speedup over SSE3 there.

The iterative throughput Qprec-div- /Qprec-sqrt- replacement for simd divide and sqrt instructions attempts to overcome some of the limitation of 128-bit chunking of the IEEE instructions, but will not help much if any with latency. You can experiment with /Qimf-accuracy options to reduce number of iterations and accuracy, to improve latency.

PKM · ‎04-09-2015

Thanks for your reply - please see comments in bold below :-)

Among possible reasons, if by ideal you mean that AVX might have twice the parallelism of SSE3 on an "ideal" CPU:

sqrt and divide have to be split by hardware into 128-bit chunks for Ivy Bridge architecture, so AVX may not be faster than SSE3.

I have also tried compiling it for AVX2 and run it on a Haswell I have at home - same result?

if your critical loops aren't blocked for L1 cache you have a 128-bit width limitation on L2 access on Ivy Bridge.

The code is constantly operating on the same 10 kilobyte of data or so per thread, so wouldn't that work right out of the box?

But, you have put NO VECTOR on what you call the critical loop, so you can't expect speedup over SSE3 there.

There is an IF statement seperating the kernel of the code in two identical paths depending on the value of input iUseVectorization - one uses NO VECTOR for all loops, the other uses VECTOR ALIGNED.

Any comments on the performance I am seeing on the Xeon Phi? 3x faster than SSE3 on an 8 core Ivy Bridge seems very low to me ...?

McCalpinJohn · ‎04-09-2015

Square roots, exponentials, and divides are all relatively expensive and have fairly complex performance characteristics --- especially if full precision is required.

As Tim Prince noted, the 256-bit AVX divide instruction provides the same throughput as the 128-bit SSE2 divide instruction on Sandy Bridge, Ivy Bridge, and Haswell. This is unchanged in AVX2. Ivy Bridge included an improvement in FP divide throughput over Sandy Bridge, and that improvement is carried forward in Haswell, but on each of those platforms the 256-bit divide instruction provides no improvement in throughput relative to using 128-bit divide instructions.

Another feature that may hurt AVX/AVX2 performance on Haswell is the lack of support for FP Addition on Port 0. This means that Sandy Bridge, Ivy Bridge, and Haswell all have the same limit of one 256-bit (4x64) AVX FP Add instruction every cycle. Haswell can issue 2 FP multiplies per cycle or 2 FMAs, but for Adds the performance is unchanged. Haswell can produce the effect of an FP Add on Port 0 using an FMA instruction (with one of the arguments to the multiplication set to 1.0), but this comes at the cost of both an extra register and increased latency -- both of which are often critical in the evaluation of complex functions.

The 3x speedup of the Xeon Phi over the Xeon E5-2650 v2 seems excellent. The Xeon E5-2650 v2 has a peak AVX performance of 198.4 GFLOPS at the maximum all-core Turbo frequency of 3.1 GHz, while the Xeon Phi 3120 has a peak performance of 1003.2 GFLOPS. But the Xeon Phi gets 1/2 of its peak performance from FMA instructions, and if the algorithms cannot use these, the 5:1 performance ratio for peak FLOPS drops to 2.5:1 if you are only computing Adds, Multiplies, or combinations of Adds and Multiplies that can't fit into the FMA instruction format(s). Since you are doing better than 2.5:1, this seems like a good result, not a disappointing one.

If you want a more detailed understanding, you should simplify the benchmark to look at the performance of square roots, exponentials, and divides independently. You would definitely want to include the Haswell platform in the comparison, since it has FMA support. Since the Xeon Phi has no hardware support for any of these operations, you would definitely need to follow Time Prince's advice and experiment with the precision controls on the Xeon Phi. If I recall correctly, Xeon Phi provides slightly lower precision by default, but there are many options that may change the details of the comparison.

PKM · ‎04-10-2015

Thank you very much for your feedback ...

I tried playing around with /Qimf-accuracy , but the effects on performance are very limited. Given that Knights landing is not so far into the future I think I will spend time doing additional optimization once that platform arrives. Based on the publicly available information on Knights Landing I would expect a compute bound code like this to scale with the increase in peak performance over the current MIC generation, ie. a ball park performance increase of 3x. Do you see anything in the already announced information that should make me lower my expectations?

Best regards,

C