Software Tuning, Performance Optimization & Platform Monitoring
Discussion regarding monitoring and software tuning methodologies, Performance Monitoring Unit (PMU) of Intel microprocessors, and platform updating.

Ideal vectorization speed-up with SSE2 and MIC512 - not AVX?

PKM
Beginner
583 Views

Hi

In the process of optimizing a large Fortran research code I have written a simple program that very closely resembles the performance characteristics of the more complicated case. The code essentially ends up spending all its time evaluating exponential functions and square roots in a vectorizable manner, so it is a compute bound problem that should be extremely well suited for Xeon phi and wide vector units in general.

By running the program below I obtained vectorized and unvectorized performance results for SSE3/AVX/Xeon phi compilations. The "funny" thing is that I get virtually ideal vectorization speed-up for SSE3 and on the Xeon Phi, but not for AVX. I am using the latest version of parallel studio on Windows and I run the program on a Xeon E5-2650 v2 with a Xeon Phi 3120. Performance numbers from running the attached code follows below ...

Any idea why the AVX speed-up is so far from ideal?

Any idea why I am not seeing better performance from the Xeon Phi? My code is clearly compute bound, embarrasingly parallel, uses aligned vector instructions, no allocations, yet the Xeon Phi is only 3x faster than the host cpu running off SSE3 instructions. For SSE3 instructions the peak performance of the host cpu should be 166 GFLOPS versus 1000 GFLOPS for the phi. So I would expect something more in line with a 6x difference?

Thank you very much in advance for you advice!

C

CPU SSE3 (NoVec/Vec): 437/871 -> 2.0x vectorization speed-up

CPU AVX (NoVec/Vec): 437/1194 -> 2.7x vectorization speed-up

Xeon Phi (NoVec/Vec): 343/2591 -> 7.6x vectorization speed-up

module mComputations

  real*8,dimension(-57:50)          :: RAll
  real*8,dimension(-55:16)        :: LambdaAll,LambdaAll2
  real*8,dimension(-55:16)        :: Ri,Ei,Fi,Hi,Un,Ui
  real*8,dimension(-55:16,1:20)   :: RefAll,UAll,E2
  real*8,dimension(1:20)            :: FiltResp    
!dir$ attributes offload : mic :: LambdaAll,LambdaAll2,Ri,Ei,Fi,Hi,Un,Ui,RefAll,Uall,E2,Filtresp,RAll
!DIR$ ATTRIBUTES ALIGN : 64 ::  LambdaAll,LambdaAll2,Ri,Ei,Fi,Hi,Un,Ui,RefAll,Uall,E2,Filtresp,RAll
!$OMP THREADPRIVATE(LambdaAll,LambdaAll2,Ri,Ei,Fi,Hi,Un,Ui,RefAll,Uall,E2,Filtresp,RAll)   
  contains
!dir$ attributes offload : mic :: DoComputations     
    subroutine DoComputations(iNoModels,iUseVectorization)
!   Input:
!    iNoModels         -> Number of models to calculate. Set high for good statistics on the benchmark.
!    iUseVectorization -> If true, the benchmark is run with aligned vector instructions. False, no vectorization.
    use omp_lib
    implicit none
    integer, intent(in) :: iNoModels
    logical, intent(in) :: iUseVectorization
    integer :: I2,I,J,IJMinCalc,IJMaxCalc,ijmaxloop,ijminloop,NoModels,t,k,Models
    real*8  :: SMy,Rs,ki2,NLayM,time,E
    real*8  :: Sigma(30),Thick(30), Timebegin,TimeEnd,Val,Kn2
    real*8  :: Nom,Denom
    real*8  :: Exparg
!DIR$ ATTRIBUTES ALIGN : 64 ::  Thick
    if (iUseVectorization) then 
      print *,'Vectorized:'
    else
      print *,'Unvectorized:'  
    end if    
    NLayM=30
    Sigma(:)=0.1 
    Thick(:)=2.5
    ijMinCalc=-55
    ijMaxCalc=16
    ijMinLoop=1
    ijMaxLoop=ijMaxCalc-ijmincalc+1
    ! Variables
    TimeBegin=omp_get_wtime()
    !Loop over models
    NoModels=iNoModels
!$OMP PARALLEL DEFAULT(PRIVATE) SHARED(iUseVectorization,NoModels,NLayM,Thick,Sigma,ijmincalc,ijmaxcalc,ijMinLoop,ijMaxLoop)
!$OMP DO
    do Models=1,NoModels
      do t=1,31
        time=log(2d0)/(1e-6*10**((t-1d0)/10d0))  
        E=10**0.1
        if (iUseVectorization==.false.) then    
!DIR$ NO VECTOR   
          do I = ijmincalc,ijmaxcalc
            Val = E**(I)*0.1d0
            lambdaAll2(I) = Val*Val
          enddo
      !Loop over frequencies - 16
          do k=1,16
            SMy=4*3.14e-7*time  
            ! start from the lowest layer
            kn2 = Smy*real(sigma(NLayM))
!DIR$ NO VECTOR 
            do J=ijmincalc,ijmaxcalc
              Un(J) = sqrt(LambdaAll2(J)+kn2)
              Fi(J) = 0
            enddo         
            do I2=NLayM-1,1,-1 ! this loop calculates from N-1 to 1 going upwar
              rs = SMy*(sigma(I2)-sigma(I2+1))
              ki2 = Smy*real(sigma(I2))
            

!DIR$ NO VECTOR  
              do J=ijmincalc,ijmaxcalc 
                !The critical loop is here!
                Ui(J) = sqrt(LambdaAll2(J)+ki2)
                Hi(J) = Ui(J)+Un(J)
                Ri(J) = rs/(Hi(J)*Hi(J)) 
                exparg = -2.d0*ui(j)*Thick(I2)
                Ei(J) = exp(exparg)
                nom = (Ei(J)*(Ri(J)+Fi(J)))
                denom = (1.d0+Ri(J)*Fi(J))
                Fi(J) = nom/denom
                Un(J) = Ui(J)       
              end do  
            end do               
          enddo
        else
!DIR$ VECTOR ALIGNED     
          do I = ijmincalc,ijmaxcalc
            Val = E**(I)*0.1d0
            lambdaAll2(I) = Val*Val
          enddo
      !Loop over frequencies - 16
          do k=1,16
            SMy=4*3.14e-7*time  
            ! start from the lowest layer
            kn2 = Smy*real(sigma(NLayM))
!DIR$ VECTOR ALIGNED  
            do J=ijmincalc,ijmaxcalc
              Un(J) = sqrt(LambdaAll2(J)+kn2)
              Fi(J) = 0
            enddo         
            do I2=NLayM-1,1,-1 ! this loop calculates from N-1 to 1 going upwar
              rs = SMy*(sigma(I2)-sigma(I2+1))
              ki2 = Smy*real(sigma(I2))

!DIR$ VECTOR ALIGNED  
              do J=ijmincalc,ijmaxcalc 
                !The critical loop is here!
                Ui(J) = sqrt(LambdaAll2(J)+ki2)
                Hi(J) = Ui(J)+Un(J)
                Ri(J) = rs/(Hi(J)*Hi(J)) 
                exparg = -2.d0*ui(j)*Thick(I2)
                Ei(J) = exp(exparg)
                nom = (Ei(J)*(Ri(J)+Fi(J)))
                denom = (1.d0+Ri(J)*Fi(J))
                Fi(J) = nom/denom
                Un(J) = Ui(J)       
              end do  
            end do               
          enddo
        end if
      end do
    end do
!$OMP END DO
!$OMP END PARALLEL        
    TimeEnd=omp_get_wtime()
    print *,'Models/s=',NoModels*1d0/(TimeEnd-TimeBegin)
    end subroutine DoComputations
    end module mComputations  
      
    program kernelopt
    use mComputations
    implicit none
    integer :: i,j
    real :: Depth(50),Values(50)
    logical :: UseVectorization
    !!set enviroment variable KMP_AFFINITY=scatter before running 
    !Perform the same calculation with and without vectorization
    print *,'CPU benchmark'
    call omp_set_num_threads(8)
    call DoComputations(8*4000,.true.)
    call DoComputations(8*1000,.false.)
    !stop
    print *,'Xeon phi benchmark' 
!DIR$ OFFLOAD BEGIN TARGET(mic:0)
    call omp_set_num_threads(224)
    call DoComputations(224*800,.true.)
    call DoComputations(224*100,.False.)
    
!DIR$ END OFFLOAD
    end program kernelopt

 

 

 

 

 

0 Kudos
4 Replies
TimP
Honored Contributor III
583 Views

Among possible reasons, if by ideal you mean that AVX might have twice the parallelism of SSE3 on an "ideal" CPU:

  simd sqrt and divide have to be split by hardware into 128-bit chunks for Ivy Bridge architecture, so AVX may not be faster than SSE3. 

  if your critical loops aren't blocked for L1 cache you have a 128-bit width limitation on L2 access on Ivy Bridge.

But, you have put NO VECTOR on what you call the critical loop, so you can't expect speedup over SSE3 there.

The iterative throughput Qprec-div- /Qprec-sqrt- replacement for simd divide and sqrt instructions attempts to overcome some of the limitation of 128-bit chunking of the IEEE instructions, but will not help much if any with latency.  You can experiment with /Qimf-accuracy options to reduce number of iterations and accuracy, to improve latency.

0 Kudos
PKM
Beginner
583 Views

Thanks for your reply - please see comments in bold below :-)

Among possible reasons, if by ideal you mean that AVX might have twice the parallelism of SSE3 on an "ideal" CPU:

  sqrt and divide have to be split by hardware into 128-bit chunks for Ivy Bridge architecture, so AVX may not be faster than SSE3. 

I have also tried compiling it for AVX2 and run it on a Haswell I have at home - same result? 

if your critical loops aren't blocked for L1 cache you have a 128-bit width limitation on L2 access on Ivy Bridge.

The code is constantly operating on the same 10 kilobyte of data or so per thread, so wouldn't that work right out of the box?

But, you have put NO VECTOR on what you call the critical loop, so you can't expect speedup over SSE3 there.

There is an IF statement seperating the kernel of the code in two identical paths depending on the value of input iUseVectorization - one uses NO VECTOR for all loops, the other uses VECTOR ALIGNED.

 

Any comments on the performance I am seeing on the Xeon Phi? 3x faster than SSE3 on an 8 core Ivy Bridge seems very low to me ...?

0 Kudos
McCalpinJohn
Honored Contributor III
583 Views

Square roots, exponentials, and divides are all relatively expensive and have fairly complex performance characteristics --- especially if full precision is required.

As Tim Prince noted, the 256-bit AVX divide instruction provides the same throughput as the 128-bit SSE2 divide instruction on Sandy Bridge, Ivy Bridge, and Haswell.  This is unchanged in AVX2.  Ivy Bridge included an improvement in FP divide throughput over Sandy Bridge, and that improvement is carried forward in Haswell, but on each of those platforms the 256-bit divide instruction provides no improvement in throughput relative to using 128-bit divide instructions.

Another feature that may hurt AVX/AVX2 performance on Haswell is the lack of support for FP Addition on Port 0.   This means that Sandy Bridge, Ivy Bridge, and Haswell all have the same limit of one 256-bit (4x64) AVX FP Add instruction every cycle.   Haswell can issue 2 FP multiplies per cycle or 2 FMAs, but for Adds the performance is unchanged.   Haswell can produce the effect of an FP Add on Port 0 using an FMA instruction (with one of the arguments to the multiplication set to 1.0), but this comes at the cost of both an extra register and increased latency -- both of which are often critical in the evaluation of complex functions.

The 3x speedup of the Xeon Phi over the Xeon E5-2650 v2 seems excellent.  The Xeon E5-2650 v2 has a peak AVX performance of 198.4 GFLOPS at the maximum all-core Turbo frequency of 3.1 GHz, while the Xeon Phi 3120 has a peak performance of 1003.2 GFLOPS.   But the Xeon Phi gets 1/2 of its peak performance from FMA instructions, and if the algorithms cannot use these, the 5:1 performance ratio for peak FLOPS drops to 2.5:1 if you are only computing Adds, Multiplies, or combinations of Adds and Multiplies that can't fit into the FMA instruction format(s).   Since you are doing better than 2.5:1, this seems like a good result, not a disappointing one.

If you want a more detailed understanding, you should simplify the benchmark to look at the performance of square roots, exponentials, and divides independently.  You would definitely want to include the Haswell platform in the comparison, since it has FMA support.   Since the Xeon Phi has no hardware support for any of these operations, you would definitely need to follow Time Prince's advice and experiment with the precision controls on the Xeon Phi.   If I recall correctly, Xeon Phi provides slightly lower precision by default, but there are many options that may change the details of the comparison.

0 Kudos
PKM
Beginner
583 Views

Thank you very much for your feedback ...

I tried playing around with  /Qimf-accuracy , but the effects on performance are very limited. Given that Knights landing is not so far into the future I think I will spend time doing additional optimization once that platform arrives. Based on the publicly available information on Knights Landing I would expect a compute bound code like this to scale with the increase in peak performance over the current MIC generation, ie. a ball park performance increase of 3x. Do you see anything in the already announced information that should make me lower my expectations?

Best regards,

 

0 Kudos
Reply