- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Dear Intel developers,
i need to vectorize the folloe code by using intel cs 2013:
subroutine mysubroutine(n, q) integer(long), intent(IN) :: n real(stnd), dimension(base_dim), intent(OUT) :: q integer(long) :: nk integer(long) :: sk, bk integer(long) :: npow real(stnd) :: x integer(long) :: i, j q = 0.0 nk = 0 do i = 1, base_dim x = logarithm( real(n + 1), real(base(i)) ) npow = floor(x) sk = n do j = npow, 0, -1 bk = base(i)**j nk = floor( real(sk) / real(bk) ) sk = sk - nk * bk q(i) = q(i) + real(nk) / real(bk * base(i)) end do end do end subroutine mysubroutine
Compiler recognize ANTI ad FLOW dependence between sk and ANTI ad FLOW dependence bewween q.
Could you like to help me to vectorize the inner loop? TI have no idea how to solve in particular the sk dependence. Thanks in advance.
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
sk is written with sequential dependence (the value for the next iteration depends on the current one).
Similarly with q(i)
Depending on parts you have removed, the compiler would like to optimize the outer loop. If base_dim were large enough, and you used consistently typed reals and integers, it might like to vectorize portions of the inner loop by interchanging so that a group of i values can be processed by parallel simd.
Your logarithm function apparently would need to be in a form which could be written in line in terms of standard math intrinsics.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Is base_dim sufficiently large enough for you to use a parallel loop?
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
400 is large enough to justify vectorization, although it may be marginal on the MIC.
Post a example which can be compiled and possibly try a current compiler. The compiler I was trying appears to be distributing the outer loop inside the inner so as to attempt that sort of vectorization.
I'd hope you were familiar enough with your algorithm to have your own ideas about how to interchange loops explicitly.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Tim,
I do not see how he can get vectorization of the inner loop due to each lane of the vector potentially (likely) having different trip counts.
This said if he convoluted the inner loop (or added additional loop nesting) he could potentially run all lanes of the vector provided j can run into negative values .AND. when negative the convolution presents a 0.0 to the summation (and do this with no flow changes in code).
This may be too hard to figure out, and it will rely on the compiler to make sense of the source code. This may be one of the cases where you hand write the code using intrinsics in C++.
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
What is the average value of npow?
IOW, what is the average trip count of the DO J loop?
If this is large enough, then making DO I= parallel might be worthwhile.
If npow is statistically small, you might be able to pre-compute the results and store into a multi-dimensioned array. Then replace the computation with an index calculation and just fetch the correct result.
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
The average value of npow is 8.

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page