Community
cancel
Showing results for
Did you mean:
Beginner
97 Views

## Poor speed in MIC

Dear All:

As learning purpose, i tried to code a program which find total number prime number for a given range. isprime function finds  if a number is prime or not. I added !\$omp declare simd to vectorize that function. I do not know why, program perform three times slower in intel phi than host.

Info:

Host: 16 sec

MIC: 43 sec

```
program prime_count
use omp_lib
implicit none
integer::min_limit,max_limit
integer::i
logical::prime
double precision::starttime,finishtime
count=0
min_limit=1
max_limit=100000000

!\$omp target device(0) map(to : min_limit,max_limit) map(tofrom:count )
!\$omp parallel private(i,prime) default(shared)

!\$omp master
starttime=omp_get_wtime()
!\$omp end master

!\$omp do simd reduction(+:count) schedule(dynamic)
do i=min_limit,max_limit
prime=isprime(i)
if(prime)then
count=count+1
end if
end do
!\$omp end do simd

!\$omp master
finishtime=omp_get_wtime()
!\$omp end master

!\$omp end parallel
!\$omp end target
print*,"Total prime number found: ",count
print*,"Total time elapsed: ",(finishtime-starttime)
contains

logical function isprime(num)
implicit none
!\$omp declare target
!\$omp declare simd(isprime)
integer,intent(in)::num
integer::i,iteration
isprime=.true.
if (num/=1 .and. num/=2)then
iteration=ceiling(sqrt(real(num)))
do i=2,iteration
if (mod(num,i)==0)then
isprime=.false.
exit
end if
end do
elseif (num==2)then
isprime=.true.
elseif (num==1)then
isprime=.false.
end if
end function isprime
end program prime_count```

7 Replies
Black Belt
97 Views

If your case can't take full advantage of proportionally more threads as well as vectorization on MIC, such an observation about performance is to be expected.

Among smaller tweaks you might investigate would be changing chunk size e.g. (dynamic,2) or schedule(auto), as well as optimizing number of threads.  MIC and host tend to require different choices.

Another would be whether you can get by with half precision sqrt, as sqrt is inefficient on MIC.

Black Belt
97 Views

I would also suggest making a specialized isprimeGT2 function where it is known that the input arg is .gt. 2.

Additionally, because you are using the known characteristic of primes testing of limiting the mod test to up to the square root of the number, one would expect that you also would use the knowledge that the primes are odd numbers. IOW start at first odd number at or above min_limit.

```      begin_limit = IOR(max(min_limit,3), 1) ! first odd number .ge. max(min_limit,3)
...
do i=begin_limit,max_limit,2 ! only test odd numbers .gt. 2
```

Then, with the foreknowledge that you are testing only odd numbers, change your do i=2,iteration to do i=3,iteration,2
IOW test only mod's of odd numbers.

As for vectorization, you should then consider adding a vector counter, which I will let you ponder as to how to write using the following hints:

partition the iteration space (noting you are doing every odd number), such that each thread, excepting the last thread has an even multiple of vector width number of odd numbers.

Initialize a vector (example n * 16) with the next sequence of odd numbers (some small but effective number that is a multiple of vector width).

Next, perform your do i=3,iteration,2 loop producing an output vector containing the mod(tempVector, i)

Next count the 0's in the output vector.

The above should provide you with sufficient hints to efficiently solve your problem.

Jim Dempsey

Beginner
97 Views

Tim, thanks for suggestions. Jim, i was really looking  for the second part of your reply regarding vectorization. I have a query, in 512 bit vector processor, it can handle 8 DP or 16 SP at a time, I would like know, how many function call can be managed by 512 bit vector processor for my case?  Another question: If 'x'  number  of functions call can be handle by vertor processor, then will schedule(dynamic,x) do proper vectorization?

--Masrul

Employee
97 Views

Masrul
Yes, the 512-bit vector unit can handle 8 double-precision or 16 single-precision operations at once.  schedule(dynamic,x) is an OpenMP thing - all vectorization will happen within the OpenMP parallel region, but not across regions.

Charles

Black Belt
97 Views

Masrul,

Here is a first cut at vectorization:

```!  isprime_vector.f90
module mod_isprime_vector
integer::min_limit,max_limit
double precision::starttime,finishtime

contains

logical function isprime(num)
implicit none
!\$omp declare target
integer,intent(in)::num
integer::i,iteration
isprime=.true.
if (num/=1 .and. num/=2)then
iteration=ceiling(sqrt(real(num)))
do i=2,iteration
if (mod(num,i)==0)then
isprime=.false.
exit
end if
end do
elseif (num==2)then
isprime=.true.
elseif (num==1)then
isprime=.false.
end if
end function isprime

logical function isprime_vector(num)
implicit none
!\$omp declare target
integer,intent(in)::num
integer::i,j,j_limit,iteration
integer, parameter :: element_size = sizeof(num)
integer, parameter :: cache_line_size = 64
integer, parameter :: vector_length = cache_line_size / element_size
!DIR\$ ATTRIBUTES ALIGN : 64 :: vector_num, vector_i, vector_mod
integer, dimension(vector_length) :: vector_num, vector_i, vector_i_advance, vector_mod

isprime_vector = .false.
if(num > 2) then
if(mod(num,2) == 0) return
vector_num = num
do i=1,vector_length
vector_i(i) = i*2+1 ! 3,5,7,...
end do

iteration=max(ceiling(sqrt(real(num))),3)
do i=3,iteration,2*vector_length
vector_mod = (vector_num / vector_i) * vector_i
vector_mod = vector_num - vector_mod
j_limit = min((num - vector_i(1))/2, vector_length)
do j=1,j_limit
if(vector_mod(j) .eq. 0) return
end do
end do
isprime_vector = .true.
else
isprime_vector = (num == 2)
endif
end function isprime_vector

end module mod_isprime_vector

program prog_isprime_vector
use mod_isprime_vector
implicit none
min_limit=1
max_limit=100000000

print *,'max_limit',max_limit
print *,'prime_count'
call prime_count
print *
print *,'prime_count_vector'
call prime_count_vector
print *

end program prog_isprime_vector

subroutine prime_count
use mod_isprime_vector
use omp_lib
implicit none
integer::i, prime

count=0
!*The following is not transferring data into the target
!*!\$omp target device(0) map(to : min_limit,max_limit) map(tofrom:count ) map(from : nthreads, starttime, finishtime)
!
!* Instead, had to use target update to transfer in the values
!\$omp target update device(0) to(min_limit,max_limit, count)
!* Then target the code to the device
!\$omp target device(0)

!\$omp parallel private(i,prime) default(shared) reduction(+:count)

!\$omp master
starttime=omp_get_wtime()
!\$omp end master

!\$omp do simd reduction(+:count) schedule(dynamic)
do i=min_limit,max_limit
prime=isprime(i)
if(prime)then
count=count+1
end if
end do
!\$omp end do simd

!\$omp master
finishtime=omp_get_wtime()
!\$omp end master

!\$omp end parallel
!\$omp end target
!* Added target update to extract data from device
print*,"Total prime number found: ",count
print*,"Total time elapsed: ",(finishtime-starttime)

end subroutine prime_count

subroutine prime_count_vector
use mod_isprime_vector
use omp_lib
implicit none
integer::i, j, prime
integer, parameter :: vector_length = 256   ! make multiple of cachline
!DIR\$ ATTRIBUTES ALIGN : 64 :: vector
integer :: vector(vector_length)

count=0
!*The following is not transferring data into the target
!*!\$omp target device(0) map(to : min_limit,max_limit) map(tofrom:count ) map(from : nthreads, starttime, finishtime)
!
!* Instead, had to use target update to transfer in the values
!\$omp target update device(0) to(min_limit,max_limit, count)
!* Then target the code to the device
!\$omp target device(0)

!\$omp parallel private(i,j,vector,prime) default(shared)

!\$omp master
print *,'master',min_limit,max_limit
starttime=omp_get_wtime()
!\$omp end master

!\$omp do schedule(dynamic,1) reduction(+:count)
do i=min_limit,max_limit
prime=isprime_vector(i)
if(prime)then
count=count+1
end if
end do
!\$omp end do

!\$omp master
finishtime=omp_get_wtime()
!\$omp end master

!\$omp end parallel
!\$omp end target
!* Added target update to extract data from device
print*,"Total prime number found: ",count
print*,"Total time elapsed: ",(finishtime-starttime)

end subroutine prime_count_vector
```

And results (Xeon Phi 5110P)

``` max_limit   100000000
prime_count
Total prime number found:      5761455
Total time elapsed:    46.4514100551605

prime_count_vector
master           1   100000000
Total prime number found:      5761455
Total time elapsed:    6.94675087928772
```

********** NOTES ***************

I am using IVF 16.0 update 1 on Windows 7 Pro x64

Look in the source code above for comments containing "!*"

Some of the !\$omp target clauses are not working as commented.
I had to substitute !\$omp target update in places.

If I double the vector_length, the results take: 6.97 seconds (within margin of error)
Halve: 8.31 seconds

Selecting vector_size to that which fits in one cache line provides for the better performance.

Jim Dempsey

Black Belt
97 Views

FWIW

```                !dec\$ if(.false.)
do j=1,j_limit
if(vector_mod(j) .eq. 0) return
end do
!dec\$ else
if(any(vector_mod(1:j_limit) == 0)) return
!dec\$ endif
```

Using ANY is slower at 9.53 seconds

Jim Dempsey

Beginner
97 Views

Jim,

Thank you very much for your painstaking effort for me. I will go through your code and will try to make sense how things work.

--Masrul