This will give better L1

Chris_S_7 · ‎05-09-2016

Hi everyone,

This is my first post. Hurray!!

Today I was doing some testing on some conditional average of a 2-dimensional field based on the quadrant location. Looking for a way to speedup (optimize) I found the WHERE instruction which I though will be faster than having IF inside a nested do-loop (using vectorization or something else). I found out that using the WHERE instruction is a little slower than actually using an IF. I was wondering if I am doing something wrong, or if it will be faster if and only if I use openMP. Below is the code I used for this testing.

Thanks in advance.

program main
      use ifport
      integer, parameter :: n1=10000, n2=5000
      real, dimension(n1,n2) :: x1,x2
      real, dimension(n1,n2,4) :: x3, x4
      real :: stime, ftime
      integer :: i,j,k


      do j = 1, n2
        do i = 1, n1
          x1(i,j) = (2.*rand()-1.)
          x2(i,j) = (2.*rand()-1.)
        enddo
      enddo
      x3 =0.
      x4 =0.

      call cpu_time(stime)
      where((x1.gt.0.).and.(x2.gt.0))
       x3(:,:,1) = x3(:,:,1) + x1*x2
      elsewhere((x1.lt.0.).and.(x2.gt.0))
       x3(:,:,2) = x3(:,:,2) + x1*x2
      elsewhere((x1.lt.0.).and.(x2.lt.0))
       x3(:,:,3) = x3(:,:,3) + x1*x2
      elsewhere((x1.gt.0.).and.(x2.lt.0))
       x3(:,:,4) = x3(:,:,4) + x1*x2
      end where
      call cpu_time(ftime)
      write(*,*) 'time ', ftime-stime

      call cpu_time(stime)
      do j=1,n2
        do i=1,n1
          if ((x1(i,j).gt.0.).and.(x2(i,j).gt.0)) then
            x4(i,j,1) = x4(i,j,1) + x1(i,j)*x2(i,j)
          elseif ((x1(i,j).lt.0.).and.(x2(i,j).gt.0)) then
            x4(i,j,2) = x4(i,j,2) + x1(i,j)*x2(i,j)
          elseif ((x1(i,j).lt.0.).and.(x2(i,j).lt.0)) then
            x4(i,j,3) = x4(i,j,3) + x1(i,j)*x2(i,j)
          elseif ((x1(i,j).gt.0.).and.(x2(i,j).lt.0)) then
            x4(i,j,4) = x4(i,j,4) + x1(i,j)*x2(i,j)
          endif
        enddo
      enddo
      call cpu_time(ftime)
      write(*,*) 'time2 ', ftime-stime

end program

time 0.1079841
time2 9.8984957E-02

TimP · ‎05-09-2016

This test doesn't prevent the compiler from taking shortcuts, as it can see that you never use any results.

One might expect the where..elsewhere... to be slower in this case, as it implies generation of a bunch of logical arrays. If the results are so close, the compiler must have done a fair job of eliminating these.

Your conditions seem sufficiently complicated to kill hope of vectorization, but in that context, elsewhere tends to be inefficient.

If you are trying to minimize cpu time, OpenMP is not the answer, as it generally trades use of more cores and total cpu time for the expectation of reduced elapsed time.

jimdempseyatthecove · ‎05-10-2016

Chris,

Experiment with something like this:

program main
      use ifport
      integer, parameter :: n1=10000, n2=5000
      real, allocatable, dimension(:,:) :: x1,x2, x1x2
      integer, allocatable, dimension(:,:) :: idx
      real, allocatable, dimension(:,:,:) :: x3, x4
      real :: stime, ftime
      integer :: i,j,k

    allocate(x1(n1,n2) ,x2(n1,n2), x1x2(n1,n2), idx(n1,n2))
    allocate(x3(n1,n2,4), x4(n1,n2,4))
      do j = 1, n2
        do i = 1, n1
          x1(i,j) = (2.*rand()-1.)
          x2(i,j) = (2.*rand()-1.)
        enddo
      enddo
      x3 =0.
      x4 =0.

      call cpu_time(stime)
      where((x1.gt.0.).and.(x2.gt.0))
       x3(:,:,1) = x3(:,:,1) + x1*x2
      elsewhere((x1.lt.0.).and.(x2.gt.0))
       x3(:,:,2) = x3(:,:,2) + x1*x2
      elsewhere((x1.lt.0.).and.(x2.lt.0))
       x3(:,:,3) = x3(:,:,3) + x1*x2
      elsewhere((x1.gt.0.).and.(x2.lt.0))
       x3(:,:,4) = x3(:,:,4) + x1*x2
      end where
      call cpu_time(ftime)
      write(*,*) 'time ', ftime-stime

      call cpu_time(stime)
      do j=1,n2
        do i=1,n1
          if ((x1(i,j).gt.0.).and.(x2(i,j).gt.0)) then
            x4(i,j,1) = x4(i,j,1) + x1(i,j)*x2(i,j)
          elseif ((x1(i,j).lt.0.).and.(x2(i,j).gt.0)) then
            x4(i,j,2) = x4(i,j,2) + x1(i,j)*x2(i,j)
          elseif ((x1(i,j).lt.0.).and.(x2(i,j).lt.0)) then
            x4(i,j,3) = x4(i,j,3) + x1(i,j)*x2(i,j)
          elseif ((x1(i,j).gt.0.).and.(x2(i,j).lt.0)) then
            x4(i,j,4) = x4(i,j,4) + x1(i,j)*x2(i,j)
          endif
        enddo
      enddo
      call cpu_time(ftime)
      write(*,*) 'time2 ', ftime-stime
      call cpu_time(stime)
      x1x2 = x1*x2
      idx = (int(sign(1.0,x1)+sign(2.0,x2))+5)/2
      ! *** note the quanrant indexes differ
      !            int +5 /2
      ! ge ge  1+2  3   8  4
      ! lt ge -1+2  1   6  3
      ! lt lt -1-2 -3   2  1
      ! ge lt  1-2 -1   4  2

      do j=1,n2
        do i=1,n1
        x4(i,j,idx(i,j)) = x4(i,j,idx(i,j)) + x1x2(i,j)
        enddo
      enddo
      call cpu_time(ftime)
      write(*,*) 'time3 ', ftime-stime

end program

 time   0.9204060
 time2   0.4056025
 time3   0.2496016

An additional 62.5%.

My system did not have AVX2 (nor AVX512). I did not test on my KNC.

As noted in the comments, the quadrature indexes are different. Those indices were arbitrary to begin with, redefinition shouldn't be much of an issue.

You will need to run some verification on the results. Note, when either or both of the x1 and x2 cells are zero, the product will be 0.0 and the accumulation into x4 will add 0.0. This calculation is faster than performing the series of if tests with conditional set.

One of the hidden problems you have with "where" if the consumption of excessive stack for the temporary arrays, also it is not directly usable with OpenMP, You can use it in OpenMP if you manually slice up the arrays. The do loop formats are easily adaptable to OpenMP.

Jim Dempsey

Chris_S_7 · ‎05-10-2016

First of all, Thanks.

Mr Dempsey,

I did some tests to the changes you suggested and I found good and bad performance under different conditions. I did different compilations with different optimization levels. I realize that the optimization (O#) wasn't actually doing much. In fact, O3 optimization was making it slower. Overall, I think the results match yours. Now, applying processor-specific optimization (-xHost) the performance of do-if method seems to surpass that of the pre-indexing (although not by much). I am very curious on what is the compiler doing here.

I left some other info below.

Thanks for the help!

ifort -O3 -xHost test2.f90 
 time   0.4919240    
 time2   0.2159681    
 time3   0.2829571    

ifort -O3 test2.f90
 time    1.215815    
 time2   0.5139220    
 time3   0.3489470 

ifort -O2 test2.f90
 time    1.143826    
 time2   0.5239201    
 time3   0.2969561 

ifort  test2.f90
 time    1.144826    
 time2   0.5249200    
 time3   0.2969551 

ifort -xHost test2.f90 
 time   0.6429019    
 time2   0.2169669    
 time3   0.2899551

ifort version 14.0.3

Intel(R) Xeon(R) CPU E5-2670 0 @ 2.60GHz

TimP · ‎05-10-2016

If the compiler performs some AVX vectorization under xHost, you can read about it in the files produced by -qopt-report4.

jimdempseyatthecove · ‎05-10-2016

This will give better L1 L2 cache access patterns

program main
      use ifport
      integer, parameter :: n1=10000, n2=5000
      !dir$ attributes align: 64:: x1,x2, x1x2, idx, x3, x4
      real, allocatable, dimension(:,:) :: x1, x2
      real, allocatable, dimension(:) :: x1x2
      integer(1), allocatable, dimension(:) :: idx
      real, allocatable, dimension(:,:,:) :: x3, x4
      real :: stime, ftime
      integer :: i,j,k

    allocate(x1(n1,n2) ,x2(n1,n2), x1x2(n1), idx(n1))
    allocate(x3(n1,n2,4), x4(n1,n2,4))
      do j = 1, n2
        do i = 1, n1
          x1(i,j) = (2.*rand()-1.)
          x2(i,j) = (2.*rand()-1.)
        enddo
      enddo
      x3 =0.
      x4 =0.

      call cpu_time(stime)
      where((x1.gt.0.).and.(x2.gt.0))
       x3(:,:,1) = x3(:,:,1) + x1*x2
      elsewhere((x1.lt.0.).and.(x2.gt.0))
       x3(:,:,2) = x3(:,:,2) + x1*x2
      elsewhere((x1.lt.0.).and.(x2.lt.0))
       x3(:,:,3) = x3(:,:,3) + x1*x2
      elsewhere((x1.gt.0.).and.(x2.lt.0))
       x3(:,:,4) = x3(:,:,4) + x1*x2
      end where
      call cpu_time(ftime)
      write(*,*) 'time ', ftime-stime

      call cpu_time(stime)
      do j=1,n2
        do i=1,n1
          if ((x1(i,j).gt.0.).and.(x2(i,j).gt.0)) then
            x4(i,j,1) = x4(i,j,1) + x1(i,j)*x2(i,j)
          elseif ((x1(i,j).lt.0.).and.(x2(i,j).gt.0)) then
            x4(i,j,2) = x4(i,j,2) + x1(i,j)*x2(i,j)
          elseif ((x1(i,j).lt.0.).and.(x2(i,j).lt.0)) then
            x4(i,j,3) = x4(i,j,3) + x1(i,j)*x2(i,j)
          elseif ((x1(i,j).gt.0.).and.(x2(i,j).lt.0)) then
            x4(i,j,4) = x4(i,j,4) + x1(i,j)*x2(i,j)
          endif
        enddo
      enddo
      call cpu_time(ftime)
      write(*,*) 'time2 ', ftime-stime
      call cpu_time(stime)
      ! *** note the quanrant indexes differ
      !            int +5 /2
      ! ge ge  1+2  3   8  4
      ! lt ge -1+2  1   6  3
      ! lt lt -1-2 -3   2  1
      ! ge lt  1-2 -1   4  2

      do j=1,n2
          x1x2 = x1(:,j)*x2(:,j)
          idx = (int(sign(1.0,x1(:,j))+sign(2.0,x2(:,j)))+5)/2
        do i=1,n1
          x4(i,j,idx(i)) = x4(i,j,idx(i)) + x1x2(i)
        enddo
      enddo
      call cpu_time(ftime)
      write(*,*) 'time3 ', ftime-stime

end program

Results:

 time   0.9204059
 time2   0.4212027
 time3   0.1560011

Jim Dempsey

Where vs do & if