Poor speed in MIC

Masrul · ‎02-07-2016

Dear All:

As learning purpose, i tried to code a program which find total number prime number for a given range. isprime function finds if a number is prime or not. I added !$omp declare simd to vectorize that function. I do not know why, program perform three times slower in intel phi than host.

Info:

Host: 16 sec

MIC: 43 sec

 
   program prime_count
       use omp_lib
       implicit none
       integer::min_limit,max_limit
       integer::count,nthreads
       integer::i
       logical::prime
       double precision::starttime,finishtime
       count=0
       min_limit=1
       max_limit=100000000


       !$omp target device(0) map(to : min_limit,max_limit) map(tofrom:count ) 
       !$omp parallel private(i,prime) default(shared)

       !$omp master
       starttime=omp_get_wtime()
       nthreads=omp_get_num_threads()
       !$omp end master


       !$omp do simd reduction(+:count) schedule(dynamic)
       do i=min_limit,max_limit
           prime=isprime(i)
           if(prime)then
               count=count+1
           end if
       end do
       !$omp end do simd 

       !$omp master
       finishtime=omp_get_wtime()
       !$omp end master

       !$omp end parallel 
       !$omp end target
       print*,"Total threads:",nthreads
       print*,"Total prime number found: ",count
       print*,"Total time elapsed: ",(finishtime-starttime)
 contains

   logical function isprime(num)
       implicit none
       !$omp declare target
       !$omp declare simd(isprime) 
       integer,intent(in)::num
       integer::i,iteration
       isprime=.true.
       if (num/=1 .and. num/=2)then
           iteration=ceiling(sqrt(real(num)))
           do i=2,iteration
               if (mod(num,i)==0)then
                   isprime=.false.
                   exit
               end if
           end do
       elseif (num==2)then
           isprime=.true.
       elseif (num==1)then
           isprime=.false.
       end if
   end function isprime
   end program prime_count

TimP · ‎02-08-2016

If your case can't take full advantage of proportionally more threads as well as vectorization on MIC, such an observation about performance is to be expected.

Among smaller tweaks you might investigate would be changing chunk size e.g. (dynamic,2) or schedule(auto), as well as optimizing number of threads. MIC and host tend to require different choices.

Another would be whether you can get by with half precision sqrt, as sqrt is inefficient on MIC.

jimdempseyatthecove · ‎02-08-2016

I would also suggest making a specialized isprimeGT2 function where it is known that the input arg is .gt. 2.

Additionally, because you are using the known characteristic of primes testing of limiting the mod test to up to the square root of the number, one would expect that you also would use the knowledge that the primes are odd numbers. IOW start at first odd number at or above min_limit.

      begin_limit = IOR(max(min_limit,3), 1) ! first odd number .ge. max(min_limit,3)
...
      do i=begin_limit,max_limit,2 ! only test odd numbers .gt. 2

Then, with the foreknowledge that you are testing only odd numbers, change your do i=2,iteration to do i=3,iteration,2
IOW test only mod's of odd numbers.

As for vectorization, you should then consider adding a vector counter, which I will let you ponder as to how to write using the following hints:

partition the iteration space (noting you are doing every odd number), such that each thread, excepting the last thread has an even multiple of vector width number of odd numbers.

Initialize a vector (example n * 16) with the next sequence of odd numbers (some small but effective number that is a multiple of vector width).

Next, perform your do i=3,iteration,2 loop producing an output vector containing the mod(tempVector, i)

Next count the 0's in the output vector.

The above should provide you with sufficient hints to efficiently solve your problem.

Jim Dempsey

Masrul · ‎02-08-2016

Tim, thanks for suggestions. Jim, i was really looking for the second part of your reply regarding vectorization. I have a query, in 512 bit vector processor, it can handle 8 DP or 16 SP at a time, I would like know, how many function call can be managed by 512 bit vector processor for my case? Another question: If 'x' number of functions call can be handle by vertor processor, then will schedule(dynamic,x) do proper vectorization?

--Masrul

Charles_C_Intel1 · ‎02-09-2016

Masrul
Yes, the 512-bit vector unit can handle 8 double-precision or 16 single-precision operations at once. schedule(dynamic,x) is an OpenMP thing - all vectorization will happen within the OpenMP parallel region, but not across regions.

Charles

jimdempseyatthecove · ‎02-09-2016

Masrul,

Here is a first cut at vectorization:

!  isprime_vector.f90 
module mod_isprime_vector
    integer::min_limit,max_limit
    integer::count,nthreads
    double precision::starttime,finishtime
    
    contains
    
    logical function isprime(num)
        implicit none
        !$omp declare target
        integer,intent(in)::num
        integer::i,iteration
        isprime=.true.
        if (num/=1 .and. num/=2)then
            iteration=ceiling(sqrt(real(num)))
            do i=2,iteration
                if (mod(num,i)==0)then
                    isprime=.false.
                    exit
                end if
            end do
        elseif (num==2)then
            isprime=.true.
        elseif (num==1)then
            isprime=.false.
        end if
    end function isprime
    
    logical function isprime_vector(num)
        implicit none
        !$omp declare target
        integer,intent(in)::num
        integer::i,j,j_limit,iteration
        integer, parameter :: element_size = sizeof(num)
        integer, parameter :: cache_line_size = 64
        integer, parameter :: vector_length = cache_line_size / element_size
        !DIR$ ATTRIBUTES ALIGN : 64 :: vector_num, vector_i, vector_mod
        integer, dimension(vector_length) :: vector_num, vector_i, vector_i_advance, vector_mod
        
        isprime_vector = .false.
        if(num > 2) then
            if(mod(num,2) == 0) return
            vector_num = num
            do i=1,vector_length
                vector_i(i) = i*2+1 ! 3,5,7,...
                vector_i_advance(i) = vector_length * 2
            end do
        
            iteration=max(ceiling(sqrt(real(num))),3)
            do i=3,iteration,2*vector_length
                vector_mod = (vector_num / vector_i) * vector_i
                vector_mod = vector_num - vector_mod
                j_limit = min((num - vector_i(1))/2, vector_length)
                do j=1,j_limit
                    if(vector_mod(j) .eq. 0) return
                end do
                vector_i = vector_i + vector_i_advance
            end do
            isprime_vector = .true.
        else
            isprime_vector = (num == 2)
        endif
    end function isprime_vector

end module mod_isprime_vector
    
program prog_isprime_vector
    use mod_isprime_vector
    implicit none
    min_limit=1
    max_limit=100000000
        
    print *,'max_limit',max_limit
    print *,'prime_count'
    call prime_count
    print *
    print *,'prime_count_vector'
    call prime_count_vector
    print *
    
end program prog_isprime_vector

subroutine prime_count
    use mod_isprime_vector
    use omp_lib
    implicit none
    integer::i, prime
    
    count=0
    !*The following is not transferring data into the target
    !*!$omp target device(0) map(to : min_limit,max_limit) map(tofrom:count ) map(from : nthreads, starttime, finishtime)
    !
    !* Instead, had to use target update to transfer in the values
    !$omp target update device(0) to(min_limit,max_limit, count)
    !* Then target the code to the device
    !$omp target device(0)
    
    !$omp parallel private(i,prime) default(shared) reduction(+:count)

    !$omp master
    starttime=omp_get_wtime()
    nthreads=omp_get_num_threads()
    !$omp end master

    !$omp do simd reduction(+:count) schedule(dynamic)
    do i=min_limit,max_limit
        prime=isprime(i)
        if(prime)then
            count=count+1
        end if
    end do
    !$omp end do simd 

    !$omp master
    finishtime=omp_get_wtime()
    !$omp end master

    !$omp end parallel 
    !$omp end target
    !* Added target update to extract data from device
    !$omp target update device(0) from(nthreads,finishtime,starttime,count)
    print*,"Total threads:",nthreads
    print*,"Total prime number found: ",count
    print*,"Total time elapsed: ",(finishtime-starttime)

    end subroutine prime_count

    
subroutine prime_count_vector
    use mod_isprime_vector
    use omp_lib
    implicit none
    integer::i, j, prime
    integer, parameter :: vector_length = 256   ! make multiple of cachline
    !DIR$ ATTRIBUTES ALIGN : 64 :: vector
    integer :: vector(vector_length)
    
    count=0
    !*The following is not transferring data into the target
    !*!$omp target device(0) map(to : min_limit,max_limit) map(tofrom:count ) map(from : nthreads, starttime, finishtime)
    !
    !* Instead, had to use target update to transfer in the values
    !$omp target update device(0) to(min_limit,max_limit, count)
    !* Then target the code to the device
    !$omp target device(0)
    
    !$omp parallel private(i,j,vector,prime) default(shared)

    !$omp master
    print *,'master',min_limit,max_limit
    starttime=omp_get_wtime()
    nthreads=omp_get_num_threads()
    !$omp end master

    !$omp do schedule(dynamic,1) reduction(+:count)
    do i=min_limit,max_limit
        prime=isprime_vector(i)
        if(prime)then
            count=count+1
        end if
    end do
    !$omp end do 

    !$omp master
    finishtime=omp_get_wtime()
    !$omp end master

    !$omp end parallel 
    !$omp end target
    !* Added target update to extract data from device
    !$omp target update device(0) from(nthreads,finishtime,starttime,count)
    print*,"Total threads:",nthreads
    print*,"Total prime number found: ",count
    print*,"Total time elapsed: ",(finishtime-starttime)

  end subroutine prime_count_vector

And results (Xeon Phi 5110P)

 max_limit   100000000
 prime_count
 Total threads:         236
 Total prime number found:      5761455
 Total time elapsed:    46.4514100551605

 prime_count_vector
 master           1   100000000
 Total threads:         236
 Total prime number found:      5761455
 Total time elapsed:    6.94675087928772

********** NOTES ***************

I am using IVF 16.0 update 1 on Windows 7 Pro x64

Look in the source code above for comments containing "!*"

Some of the !$omp target clauses are not working as commented.
I had to substitute !$omp target update in places.

If I double the vector_length, the results take: 6.97 seconds (within margin of error)
Quadruple: 7.23 seconds
Halve: 8.31 seconds

Selecting vector_size to that which fits in one cache line provides for the better performance.

Jim Dempsey

jimdempseyatthecove · ‎02-09-2016

FWIW

                !dec$ if(.false.)
                do j=1,j_limit
                    if(vector_mod(j) .eq. 0) return
                end do
                !dec$ else
                if(any(vector_mod(1:j_limit) == 0)) return
                !dec$ endif

Using ANY is slower at 9.53 seconds

Jim Dempsey

Masrul · ‎02-09-2016

Jim,

Thank you very much for your painstaking effort for me. I will go through your code and will try to make sense how things work.

--Masrul