- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi everyone,
This is my first post. Hurray!!
Today I was doing some testing on some conditional average of a 2-dimensional field based on the quadrant location. Looking for a way to speedup (optimize) I found the WHERE instruction which I though will be faster than having IF inside a nested do-loop (using vectorization or something else). I found out that using the WHERE instruction is a little slower than actually using an IF. I was wondering if I am doing something wrong, or if it will be faster if and only if I use openMP. Below is the code I used for this testing.
Thanks in advance.
program main use ifport integer, parameter :: n1=10000, n2=5000 real, dimension(n1,n2) :: x1,x2 real, dimension(n1,n2,4) :: x3, x4 real :: stime, ftime integer :: i,j,k do j = 1, n2 do i = 1, n1 x1(i,j) = (2.*rand()-1.) x2(i,j) = (2.*rand()-1.) enddo enddo x3 =0. x4 =0. call cpu_time(stime) where((x1.gt.0.).and.(x2.gt.0)) x3(:,:,1) = x3(:,:,1) + x1*x2 elsewhere((x1.lt.0.).and.(x2.gt.0)) x3(:,:,2) = x3(:,:,2) + x1*x2 elsewhere((x1.lt.0.).and.(x2.lt.0)) x3(:,:,3) = x3(:,:,3) + x1*x2 elsewhere((x1.gt.0.).and.(x2.lt.0)) x3(:,:,4) = x3(:,:,4) + x1*x2 end where call cpu_time(ftime) write(*,*) 'time ', ftime-stime call cpu_time(stime) do j=1,n2 do i=1,n1 if ((x1(i,j).gt.0.).and.(x2(i,j).gt.0)) then x4(i,j,1) = x4(i,j,1) + x1(i,j)*x2(i,j) elseif ((x1(i,j).lt.0.).and.(x2(i,j).gt.0)) then x4(i,j,2) = x4(i,j,2) + x1(i,j)*x2(i,j) elseif ((x1(i,j).lt.0.).and.(x2(i,j).lt.0)) then x4(i,j,3) = x4(i,j,3) + x1(i,j)*x2(i,j) elseif ((x1(i,j).gt.0.).and.(x2(i,j).lt.0)) then x4(i,j,4) = x4(i,j,4) + x1(i,j)*x2(i,j) endif enddo enddo call cpu_time(ftime) write(*,*) 'time2 ', ftime-stime end program
time 0.1079841
time2 9.8984957E-02
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
This test doesn't prevent the compiler from taking shortcuts, as it can see that you never use any results.
One might expect the where..elsewhere... to be slower in this case, as it implies generation of a bunch of logical arrays. If the results are so close, the compiler must have done a fair job of eliminating these.
Your conditions seem sufficiently complicated to kill hope of vectorization, but in that context, elsewhere tends to be inefficient.
If you are trying to minimize cpu time, OpenMP is not the answer, as it generally trades use of more cores and total cpu time for the expectation of reduced elapsed time.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Chris,
Experiment with something like this:
program main use ifport integer, parameter :: n1=10000, n2=5000 real, allocatable, dimension(:,:) :: x1,x2, x1x2 integer, allocatable, dimension(:,:) :: idx real, allocatable, dimension(:,:,:) :: x3, x4 real :: stime, ftime integer :: i,j,k allocate(x1(n1,n2) ,x2(n1,n2), x1x2(n1,n2), idx(n1,n2)) allocate(x3(n1,n2,4), x4(n1,n2,4)) do j = 1, n2 do i = 1, n1 x1(i,j) = (2.*rand()-1.) x2(i,j) = (2.*rand()-1.) enddo enddo x3 =0. x4 =0. call cpu_time(stime) where((x1.gt.0.).and.(x2.gt.0)) x3(:,:,1) = x3(:,:,1) + x1*x2 elsewhere((x1.lt.0.).and.(x2.gt.0)) x3(:,:,2) = x3(:,:,2) + x1*x2 elsewhere((x1.lt.0.).and.(x2.lt.0)) x3(:,:,3) = x3(:,:,3) + x1*x2 elsewhere((x1.gt.0.).and.(x2.lt.0)) x3(:,:,4) = x3(:,:,4) + x1*x2 end where call cpu_time(ftime) write(*,*) 'time ', ftime-stime call cpu_time(stime) do j=1,n2 do i=1,n1 if ((x1(i,j).gt.0.).and.(x2(i,j).gt.0)) then x4(i,j,1) = x4(i,j,1) + x1(i,j)*x2(i,j) elseif ((x1(i,j).lt.0.).and.(x2(i,j).gt.0)) then x4(i,j,2) = x4(i,j,2) + x1(i,j)*x2(i,j) elseif ((x1(i,j).lt.0.).and.(x2(i,j).lt.0)) then x4(i,j,3) = x4(i,j,3) + x1(i,j)*x2(i,j) elseif ((x1(i,j).gt.0.).and.(x2(i,j).lt.0)) then x4(i,j,4) = x4(i,j,4) + x1(i,j)*x2(i,j) endif enddo enddo call cpu_time(ftime) write(*,*) 'time2 ', ftime-stime call cpu_time(stime) x1x2 = x1*x2 idx = (int(sign(1.0,x1)+sign(2.0,x2))+5)/2 ! *** note the quanrant indexes differ ! int +5 /2 ! ge ge 1+2 3 8 4 ! lt ge -1+2 1 6 3 ! lt lt -1-2 -3 2 1 ! ge lt 1-2 -1 4 2 do j=1,n2 do i=1,n1 x4(i,j,idx(i,j)) = x4(i,j,idx(i,j)) + x1x2(i,j) enddo enddo call cpu_time(ftime) write(*,*) 'time3 ', ftime-stime end program
time 0.9204060 time2 0.4056025 time3 0.2496016
An additional 62.5%.
My system did not have AVX2 (nor AVX512). I did not test on my KNC.
As noted in the comments, the quadrature indexes are different. Those indices were arbitrary to begin with, redefinition shouldn't be much of an issue.
You will need to run some verification on the results. Note, when either or both of the x1 and x2 cells are zero, the product will be 0.0 and the accumulation into x4 will add 0.0. This calculation is faster than performing the series of if tests with conditional set.
One of the hidden problems you have with "where" if the consumption of excessive stack for the temporary arrays, also it is not directly usable with OpenMP, You can use it in OpenMP if you manually slice up the arrays. The do loop formats are easily adaptable to OpenMP.
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
First of all, Thanks.
Mr Dempsey,
I did some tests to the changes you suggested and I found good and bad performance under different conditions. I did different compilations with different optimization levels. I realize that the optimization (O#) wasn't actually doing much. In fact, O3 optimization was making it slower. Overall, I think the results match yours. Now, applying processor-specific optimization (-xHost) the performance of do-if method seems to surpass that of the pre-indexing (although not by much). I am very curious on what is the compiler doing here.
I left some other info below.
Thanks for the help!
ifort -O3 -xHost test2.f90 time 0.4919240 time2 0.2159681 time3 0.2829571 ifort -O3 test2.f90 time 1.215815 time2 0.5139220 time3 0.3489470 ifort -O2 test2.f90 time 1.143826 time2 0.5239201 time3 0.2969561 ifort test2.f90 time 1.144826 time2 0.5249200 time3 0.2969551 ifort -xHost test2.f90 time 0.6429019 time2 0.2169669 time3 0.2899551
ifort version 14.0.3 Intel(R) Xeon(R) CPU E5-2670 0 @ 2.60GHz
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
If the compiler performs some AVX vectorization under xHost, you can read about it in the files produced by -qopt-report4.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
This will give better L1 L2 cache access patterns
program main use ifport integer, parameter :: n1=10000, n2=5000 !dir$ attributes align: 64:: x1,x2, x1x2, idx, x3, x4 real, allocatable, dimension(:,:) :: x1, x2 real, allocatable, dimension(:) :: x1x2 integer(1), allocatable, dimension(:) :: idx real, allocatable, dimension(:,:,:) :: x3, x4 real :: stime, ftime integer :: i,j,k allocate(x1(n1,n2) ,x2(n1,n2), x1x2(n1), idx(n1)) allocate(x3(n1,n2,4), x4(n1,n2,4)) do j = 1, n2 do i = 1, n1 x1(i,j) = (2.*rand()-1.) x2(i,j) = (2.*rand()-1.) enddo enddo x3 =0. x4 =0. call cpu_time(stime) where((x1.gt.0.).and.(x2.gt.0)) x3(:,:,1) = x3(:,:,1) + x1*x2 elsewhere((x1.lt.0.).and.(x2.gt.0)) x3(:,:,2) = x3(:,:,2) + x1*x2 elsewhere((x1.lt.0.).and.(x2.lt.0)) x3(:,:,3) = x3(:,:,3) + x1*x2 elsewhere((x1.gt.0.).and.(x2.lt.0)) x3(:,:,4) = x3(:,:,4) + x1*x2 end where call cpu_time(ftime) write(*,*) 'time ', ftime-stime call cpu_time(stime) do j=1,n2 do i=1,n1 if ((x1(i,j).gt.0.).and.(x2(i,j).gt.0)) then x4(i,j,1) = x4(i,j,1) + x1(i,j)*x2(i,j) elseif ((x1(i,j).lt.0.).and.(x2(i,j).gt.0)) then x4(i,j,2) = x4(i,j,2) + x1(i,j)*x2(i,j) elseif ((x1(i,j).lt.0.).and.(x2(i,j).lt.0)) then x4(i,j,3) = x4(i,j,3) + x1(i,j)*x2(i,j) elseif ((x1(i,j).gt.0.).and.(x2(i,j).lt.0)) then x4(i,j,4) = x4(i,j,4) + x1(i,j)*x2(i,j) endif enddo enddo call cpu_time(ftime) write(*,*) 'time2 ', ftime-stime call cpu_time(stime) ! *** note the quanrant indexes differ ! int +5 /2 ! ge ge 1+2 3 8 4 ! lt ge -1+2 1 6 3 ! lt lt -1-2 -3 2 1 ! ge lt 1-2 -1 4 2 do j=1,n2 x1x2 = x1(:,j)*x2(:,j) idx = (int(sign(1.0,x1(:,j))+sign(2.0,x2(:,j)))+5)/2 do i=1,n1 x4(i,j,idx(i)) = x4(i,j,idx(i)) + x1x2(i) enddo enddo call cpu_time(ftime) write(*,*) 'time3 ', ftime-stime end program
Results:
time 0.9204059 time2 0.4212027 time3 0.1560011
Jim Dempsey

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page