- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Dear All:
As learning purpose, i tried to code a program which find total number prime number for a given range. isprime function finds if a number is prime or not. I added !$omp declare simd to vectorize that function. I do not know why, program perform three times slower in intel phi than host.
Info:
Host: 16 sec
MIC: 43 sec
program prime_count use omp_lib implicit none integer::min_limit,max_limit integer::count,nthreads integer::i logical::prime double precision::starttime,finishtime count=0 min_limit=1 max_limit=100000000 !$omp target device(0) map(to : min_limit,max_limit) map(tofrom:count ) !$omp parallel private(i,prime) default(shared) !$omp master starttime=omp_get_wtime() nthreads=omp_get_num_threads() !$omp end master !$omp do simd reduction(+:count) schedule(dynamic) do i=min_limit,max_limit prime=isprime(i) if(prime)then count=count+1 end if end do !$omp end do simd !$omp master finishtime=omp_get_wtime() !$omp end master !$omp end parallel !$omp end target print*,"Total threads:",nthreads print*,"Total prime number found: ",count print*,"Total time elapsed: ",(finishtime-starttime) contains logical function isprime(num) implicit none !$omp declare target !$omp declare simd(isprime) integer,intent(in)::num integer::i,iteration isprime=.true. if (num/=1 .and. num/=2)then iteration=ceiling(sqrt(real(num))) do i=2,iteration if (mod(num,i)==0)then isprime=.false. exit end if end do elseif (num==2)then isprime=.true. elseif (num==1)then isprime=.false. end if end function isprime end program prime_count
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
If your case can't take full advantage of proportionally more threads as well as vectorization on MIC, such an observation about performance is to be expected.
Among smaller tweaks you might investigate would be changing chunk size e.g. (dynamic,2) or schedule(auto), as well as optimizing number of threads. MIC and host tend to require different choices.
Another would be whether you can get by with half precision sqrt, as sqrt is inefficient on MIC.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I would also suggest making a specialized isprimeGT2 function where it is known that the input arg is .gt. 2.
Additionally, because you are using the known characteristic of primes testing of limiting the mod test to up to the square root of the number, one would expect that you also would use the knowledge that the primes are odd numbers. IOW start at first odd number at or above min_limit.
begin_limit = IOR(max(min_limit,3), 1) ! first odd number .ge. max(min_limit,3) ... do i=begin_limit,max_limit,2 ! only test odd numbers .gt. 2
Then, with the foreknowledge that you are testing only odd numbers, change your do i=2,iteration to do i=3,iteration,2
IOW test only mod's of odd numbers.
As for vectorization, you should then consider adding a vector counter, which I will let you ponder as to how to write using the following hints:
partition the iteration space (noting you are doing every odd number), such that each thread, excepting the last thread has an even multiple of vector width number of odd numbers.
Initialize a vector (example n * 16) with the next sequence of odd numbers (some small but effective number that is a multiple of vector width).
Next, perform your do i=3,iteration,2 loop producing an output vector containing the mod(tempVector, i)
Next count the 0's in the output vector.
The above should provide you with sufficient hints to efficiently solve your problem.
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Tim, thanks for suggestions. Jim, i was really looking for the second part of your reply regarding vectorization. I have a query, in 512 bit vector processor, it can handle 8 DP or 16 SP at a time, I would like know, how many function call can be managed by 512 bit vector processor for my case? Another question: If 'x' number of functions call can be handle by vertor processor, then will schedule(dynamic,x) do proper vectorization?
--Masrul
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Masrul
Yes, the 512-bit vector unit can handle 8 double-precision or 16 single-precision operations at once. schedule(dynamic,x) is an OpenMP thing - all vectorization will happen within the OpenMP parallel region, but not across regions.
Charles
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Masrul,
Here is a first cut at vectorization:
! isprime_vector.f90 module mod_isprime_vector integer::min_limit,max_limit integer::count,nthreads double precision::starttime,finishtime contains logical function isprime(num) implicit none !$omp declare target integer,intent(in)::num integer::i,iteration isprime=.true. if (num/=1 .and. num/=2)then iteration=ceiling(sqrt(real(num))) do i=2,iteration if (mod(num,i)==0)then isprime=.false. exit end if end do elseif (num==2)then isprime=.true. elseif (num==1)then isprime=.false. end if end function isprime logical function isprime_vector(num) implicit none !$omp declare target integer,intent(in)::num integer::i,j,j_limit,iteration integer, parameter :: element_size = sizeof(num) integer, parameter :: cache_line_size = 64 integer, parameter :: vector_length = cache_line_size / element_size !DIR$ ATTRIBUTES ALIGN : 64 :: vector_num, vector_i, vector_mod integer, dimension(vector_length) :: vector_num, vector_i, vector_i_advance, vector_mod isprime_vector = .false. if(num > 2) then if(mod(num,2) == 0) return vector_num = num do i=1,vector_length vector_i(i) = i*2+1 ! 3,5,7,... vector_i_advance(i) = vector_length * 2 end do iteration=max(ceiling(sqrt(real(num))),3) do i=3,iteration,2*vector_length vector_mod = (vector_num / vector_i) * vector_i vector_mod = vector_num - vector_mod j_limit = min((num - vector_i(1))/2, vector_length) do j=1,j_limit if(vector_mod(j) .eq. 0) return end do vector_i = vector_i + vector_i_advance end do isprime_vector = .true. else isprime_vector = (num == 2) endif end function isprime_vector end module mod_isprime_vector program prog_isprime_vector use mod_isprime_vector implicit none min_limit=1 max_limit=100000000 print *,'max_limit',max_limit print *,'prime_count' call prime_count print * print *,'prime_count_vector' call prime_count_vector print * end program prog_isprime_vector subroutine prime_count use mod_isprime_vector use omp_lib implicit none integer::i, prime count=0 !*The following is not transferring data into the target !*!$omp target device(0) map(to : min_limit,max_limit) map(tofrom:count ) map(from : nthreads, starttime, finishtime) ! !* Instead, had to use target update to transfer in the values !$omp target update device(0) to(min_limit,max_limit, count) !* Then target the code to the device !$omp target device(0) !$omp parallel private(i,prime) default(shared) reduction(+:count) !$omp master starttime=omp_get_wtime() nthreads=omp_get_num_threads() !$omp end master !$omp do simd reduction(+:count) schedule(dynamic) do i=min_limit,max_limit prime=isprime(i) if(prime)then count=count+1 end if end do !$omp end do simd !$omp master finishtime=omp_get_wtime() !$omp end master !$omp end parallel !$omp end target !* Added target update to extract data from device !$omp target update device(0) from(nthreads,finishtime,starttime,count) print*,"Total threads:",nthreads print*,"Total prime number found: ",count print*,"Total time elapsed: ",(finishtime-starttime) end subroutine prime_count subroutine prime_count_vector use mod_isprime_vector use omp_lib implicit none integer::i, j, prime integer, parameter :: vector_length = 256 ! make multiple of cachline !DIR$ ATTRIBUTES ALIGN : 64 :: vector integer :: vector(vector_length) count=0 !*The following is not transferring data into the target !*!$omp target device(0) map(to : min_limit,max_limit) map(tofrom:count ) map(from : nthreads, starttime, finishtime) ! !* Instead, had to use target update to transfer in the values !$omp target update device(0) to(min_limit,max_limit, count) !* Then target the code to the device !$omp target device(0) !$omp parallel private(i,j,vector,prime) default(shared) !$omp master print *,'master',min_limit,max_limit starttime=omp_get_wtime() nthreads=omp_get_num_threads() !$omp end master !$omp do schedule(dynamic,1) reduction(+:count) do i=min_limit,max_limit prime=isprime_vector(i) if(prime)then count=count+1 end if end do !$omp end do !$omp master finishtime=omp_get_wtime() !$omp end master !$omp end parallel !$omp end target !* Added target update to extract data from device !$omp target update device(0) from(nthreads,finishtime,starttime,count) print*,"Total threads:",nthreads print*,"Total prime number found: ",count print*,"Total time elapsed: ",(finishtime-starttime) end subroutine prime_count_vector
And results (Xeon Phi 5110P)
max_limit 100000000 prime_count Total threads: 236 Total prime number found: 5761455 Total time elapsed: 46.4514100551605 prime_count_vector master 1 100000000 Total threads: 236 Total prime number found: 5761455 Total time elapsed: 6.94675087928772
********** NOTES ***************
I am using IVF 16.0 update 1 on Windows 7 Pro x64
Look in the source code above for comments containing "!*"
Some of the !$omp target clauses are not working as commented.
I had to substitute !$omp target update in places.
If I double the vector_length, the results take: 6.97 seconds (within margin of error)
Quadruple: 7.23 seconds
Halve: 8.31 seconds
Selecting vector_size to that which fits in one cache line provides for the better performance.
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
FWIW
!dec$ if(.false.) do j=1,j_limit if(vector_mod(j) .eq. 0) return end do !dec$ else if(any(vector_mod(1:j_limit) == 0)) return !dec$ endif
Using ANY is slower at 9.53 seconds
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Jim,
Thank you very much for your painstaking effort for me. I will go through your code and will try to make sense how things work.
--Masrul

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page