Community
cancel
Showing results for
Did you mean:
Highlighted
Beginner
4 Views

## Where vs do & if

Hi everyone,

This is my first post. Hurray!!

Today I was doing some testing on some conditional average of a 2-dimensional field based on the quadrant location. Looking for a way to speedup (optimize) I found the WHERE instruction which I though will be faster than having IF inside a nested do-loop (using vectorization or something else). I found out that using the WHERE instruction is a little slower than actually using an IF.  I was wondering if I am doing something wrong, or if it will be faster if and only if I use openMP. Below is the code I used for this testing.

```program main
use ifport
integer, parameter :: n1=10000, n2=5000
real, dimension(n1,n2) :: x1,x2
real, dimension(n1,n2,4) :: x3, x4
real :: stime, ftime
integer :: i,j,k

do j = 1, n2
do i = 1, n1
x1(i,j) = (2.*rand()-1.)
x2(i,j) = (2.*rand()-1.)
enddo
enddo
x3 =0.
x4 =0.

call cpu_time(stime)
where((x1.gt.0.).and.(x2.gt.0))
x3(:,:,1) = x3(:,:,1) + x1*x2
elsewhere((x1.lt.0.).and.(x2.gt.0))
x3(:,:,2) = x3(:,:,2) + x1*x2
elsewhere((x1.lt.0.).and.(x2.lt.0))
x3(:,:,3) = x3(:,:,3) + x1*x2
elsewhere((x1.gt.0.).and.(x2.lt.0))
x3(:,:,4) = x3(:,:,4) + x1*x2
end where
call cpu_time(ftime)
write(*,*) 'time ', ftime-stime

call cpu_time(stime)
do j=1,n2
do i=1,n1
if ((x1(i,j).gt.0.).and.(x2(i,j).gt.0)) then
x4(i,j,1) = x4(i,j,1) + x1(i,j)*x2(i,j)
elseif ((x1(i,j).lt.0.).and.(x2(i,j).gt.0)) then
x4(i,j,2) = x4(i,j,2) + x1(i,j)*x2(i,j)
elseif ((x1(i,j).lt.0.).and.(x2(i,j).lt.0)) then
x4(i,j,3) = x4(i,j,3) + x1(i,j)*x2(i,j)
elseif ((x1(i,j).gt.0.).and.(x2(i,j).lt.0)) then
x4(i,j,4) = x4(i,j,4) + x1(i,j)*x2(i,j)
endif
enddo
enddo
call cpu_time(ftime)
write(*,*) 'time2 ', ftime-stime

end program```

time   0.1079841
time2   9.8984957E-02

Tags (1)
5 Replies
Highlighted
Black Belt
4 Views

## This test doesn't prevent the

This test doesn't prevent the compiler from taking shortcuts, as it can see that you never use any results.

One might expect the where..elsewhere... to be slower in this case, as it implies generation of a bunch of logical arrays.  If the results are so close, the compiler must have done a fair job of eliminating these.

Your conditions seem sufficiently complicated to kill hope of vectorization, but in that context, elsewhere tends to be inefficient.

If you are trying to minimize cpu time, OpenMP is not the answer, as it generally trades use of more cores and total cpu time for the expectation of reduced elapsed time.

Highlighted
Black Belt
4 Views

## Chris,

Chris,

Experiment with something like this:

```program main
use ifport
integer, parameter :: n1=10000, n2=5000
real, allocatable, dimension(:,:) :: x1,x2, x1x2
integer, allocatable, dimension(:,:) :: idx
real, allocatable, dimension(:,:,:) :: x3, x4
real :: stime, ftime
integer :: i,j,k

allocate(x1(n1,n2) ,x2(n1,n2), x1x2(n1,n2), idx(n1,n2))
allocate(x3(n1,n2,4), x4(n1,n2,4))
do j = 1, n2
do i = 1, n1
x1(i,j) = (2.*rand()-1.)
x2(i,j) = (2.*rand()-1.)
enddo
enddo
x3 =0.
x4 =0.

call cpu_time(stime)
where((x1.gt.0.).and.(x2.gt.0))
x3(:,:,1) = x3(:,:,1) + x1*x2
elsewhere((x1.lt.0.).and.(x2.gt.0))
x3(:,:,2) = x3(:,:,2) + x1*x2
elsewhere((x1.lt.0.).and.(x2.lt.0))
x3(:,:,3) = x3(:,:,3) + x1*x2
elsewhere((x1.gt.0.).and.(x2.lt.0))
x3(:,:,4) = x3(:,:,4) + x1*x2
end where
call cpu_time(ftime)
write(*,*) 'time ', ftime-stime

call cpu_time(stime)
do j=1,n2
do i=1,n1
if ((x1(i,j).gt.0.).and.(x2(i,j).gt.0)) then
x4(i,j,1) = x4(i,j,1) + x1(i,j)*x2(i,j)
elseif ((x1(i,j).lt.0.).and.(x2(i,j).gt.0)) then
x4(i,j,2) = x4(i,j,2) + x1(i,j)*x2(i,j)
elseif ((x1(i,j).lt.0.).and.(x2(i,j).lt.0)) then
x4(i,j,3) = x4(i,j,3) + x1(i,j)*x2(i,j)
elseif ((x1(i,j).gt.0.).and.(x2(i,j).lt.0)) then
x4(i,j,4) = x4(i,j,4) + x1(i,j)*x2(i,j)
endif
enddo
enddo
call cpu_time(ftime)
write(*,*) 'time2 ', ftime-stime
call cpu_time(stime)
x1x2 = x1*x2
idx = (int(sign(1.0,x1)+sign(2.0,x2))+5)/2
! *** note the quanrant indexes differ
!            int +5 /2
! ge ge  1+2  3   8  4
! lt ge -1+2  1   6  3
! lt lt -1-2 -3   2  1
! ge lt  1-2 -1   4  2

do j=1,n2
do i=1,n1
x4(i,j,idx(i,j)) = x4(i,j,idx(i,j)) + x1x2(i,j)
enddo
enddo
call cpu_time(ftime)
write(*,*) 'time3 ', ftime-stime

end program
```
``` time   0.9204060
time2   0.4056025
time3   0.2496016```

My system did not have AVX2 (nor AVX512). I did not test on my KNC.

As noted in the comments, the quadrature indexes are different. Those indices were arbitrary to begin with, redefinition shouldn't be much of an issue.

You will need to run some verification on the results. Note, when either or both of the x1 and x2 cells are zero, the product will be 0.0 and the accumulation into x4 will add 0.0. This calculation is faster than performing the series of if tests with conditional set.

One of the hidden problems you have with "where" if the consumption of excessive stack for the temporary arrays, also it is not directly usable with OpenMP, You can use it in OpenMP if you manually slice up the arrays. The do loop formats are easily adaptable to OpenMP.

Jim Dempsey

Beginner
4 Views

## First of all, Thanks.

First of all, Thanks.

Mr Dempsey,

I did some tests to the changes you suggested and I found good and bad performance under different conditions. I did different compilations with different optimization levels. I realize that the optimization (O#) wasn't actually doing much. In fact, O3 optimization was making it slower. Overall, I think the results match yours. Now,  applying processor-specific optimization (-xHost) the performance of do-if method seems to surpass that of the pre-indexing (although not by much).  I am very curious on what is the compiler doing here.

I left some other info below.

Thanks for the help!

```ifort -O3 -xHost test2.f90
time   0.4919240
time2   0.2159681
time3   0.2829571

ifort -O3 test2.f90
time    1.215815
time2   0.5139220
time3   0.3489470

ifort -O2 test2.f90
time    1.143826
time2   0.5239201
time3   0.2969561

ifort  test2.f90
time    1.144826
time2   0.5249200
time3   0.2969551

ifort -xHost test2.f90
time   0.6429019
time2   0.2169669
time3   0.2899551    ```

```ifort version 14.0.3

Intel(R) Xeon(R) CPU E5-2670 0 @ 2.60GHz```

Highlighted
Black Belt
4 Views

## If the compiler performs some

If the compiler performs some AVX vectorization under xHost, you can read about it in the files produced by -qopt-report4.

Highlighted
Black Belt
4 Views

## This will give better L1

This will give better L1 L2 cache access patterns

```program main
use ifport
integer, parameter :: n1=10000, n2=5000
!dir\$ attributes align: 64:: x1,x2, x1x2, idx, x3, x4
real, allocatable, dimension(:,:) :: x1, x2
real, allocatable, dimension(:) :: x1x2
integer(1), allocatable, dimension(:) :: idx
real, allocatable, dimension(:,:,:) :: x3, x4
real :: stime, ftime
integer :: i,j,k

allocate(x1(n1,n2) ,x2(n1,n2), x1x2(n1), idx(n1))
allocate(x3(n1,n2,4), x4(n1,n2,4))
do j = 1, n2
do i = 1, n1
x1(i,j) = (2.*rand()-1.)
x2(i,j) = (2.*rand()-1.)
enddo
enddo
x3 =0.
x4 =0.

call cpu_time(stime)
where((x1.gt.0.).and.(x2.gt.0))
x3(:,:,1) = x3(:,:,1) + x1*x2
elsewhere((x1.lt.0.).and.(x2.gt.0))
x3(:,:,2) = x3(:,:,2) + x1*x2
elsewhere((x1.lt.0.).and.(x2.lt.0))
x3(:,:,3) = x3(:,:,3) + x1*x2
elsewhere((x1.gt.0.).and.(x2.lt.0))
x3(:,:,4) = x3(:,:,4) + x1*x2
end where
call cpu_time(ftime)
write(*,*) 'time ', ftime-stime

call cpu_time(stime)
do j=1,n2
do i=1,n1
if ((x1(i,j).gt.0.).and.(x2(i,j).gt.0)) then
x4(i,j,1) = x4(i,j,1) + x1(i,j)*x2(i,j)
elseif ((x1(i,j).lt.0.).and.(x2(i,j).gt.0)) then
x4(i,j,2) = x4(i,j,2) + x1(i,j)*x2(i,j)
elseif ((x1(i,j).lt.0.).and.(x2(i,j).lt.0)) then
x4(i,j,3) = x4(i,j,3) + x1(i,j)*x2(i,j)
elseif ((x1(i,j).gt.0.).and.(x2(i,j).lt.0)) then
x4(i,j,4) = x4(i,j,4) + x1(i,j)*x2(i,j)
endif
enddo
enddo
call cpu_time(ftime)
write(*,*) 'time2 ', ftime-stime
call cpu_time(stime)
! *** note the quanrant indexes differ
!            int +5 /2
! ge ge  1+2  3   8  4
! lt ge -1+2  1   6  3
! lt lt -1-2 -3   2  1
! ge lt  1-2 -1   4  2

do j=1,n2
x1x2 = x1(:,j)*x2(:,j)
idx = (int(sign(1.0,x1(:,j))+sign(2.0,x2(:,j)))+5)/2
do i=1,n1
x4(i,j,idx(i)) = x4(i,j,idx(i)) + x1x2(i)
enddo
enddo
call cpu_time(ftime)
write(*,*) 'time3 ', ftime-stime

end program
```

Results:

``` time   0.9204059
time2   0.4212027
time3   0.1560011```

Jim Dempsey