- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hello everyone,
I am writing code for an n-body simulation that I would like to put on a xeon phi. The subroutine I am working on involves calculating the energy of the system via a pairwise summation. The O(n2) algorithm would look like this, and is simply a double sum over all the particle pairs and np = number of particles. If a two particles are closer than a distance rcut2, their energy is added.
subroutine nearest_int implicit none double precision :: dx,dy,dz double precision :: x1,y1,z1 double precision :: x2,y2,z2 double precision :: dr2,dr2i,dr6i,dr12i integer :: i,j integer :: T1,T2,clock_rate,clock_max potential = 0.0d0 call system_clock(T1,clock_rate,clock_max) !dir$ offload begin target(mic:0) in(position) !$omp parallel do schedule(dynamic) reduction(+:potential) default(private),& !$omp& shared(position,rcut2,np) do i = 1,np x1 = position(i)%x; y1 = position(i)%y; z1 = position(i)%z !dir$ simd reduction(+:potential) do j = i+1,np x2 = position(j)%x; y2 = position(j)%y; z2 = position(j)%z dx = x2-x1 dy = y2-y1 dz = z2-z1 dr2 = dx*dx + dy*dy + dz*dz if(dr2.lt.rcut2)then dr2i = 1.0d0/dr2 dr6i = dr2i*dr2i*dr2i dr12i = dr6i*dr6i potential = potential + 4.0d0*(dr12i-dr6i) endif enddo enddo !$omp end parallel do !dir$ end offload call system_clock(T2,clock_rate,clock_max) print*,'elapsed time nint:',real(T2-T1)/real(clock_rate),potential end subroutine nearest_int
Here, position is an array of structures as follows
type atom double precision :: x,y,z end type atom type(atom) :: position
Now when I run my code using the O(n2) algorithm, the vectorization intensity is 6.51 which is good given that gather/scatter operations are being applied. Screenshots of the vtune summary are attached png's below entitled n2-1.png and n2-2.png. Now since O(n2) gives poor scaling to larger systems, an O(N) algorithm is preferred. To do this, we store in an array the particles that are close (1.2*rcut to be exact) and how many neighbors each particle has. The O(n2) algorithm now transforms to the following which involves using pointers to access the position array.
subroutine nearest_int implicit none double precision :: dx,dy,dz double precision :: x1,y1,z1 double precision :: x2,y2,z2 double precision :: dr2,dr2i,dr6i,dr12i integer :: i,j integer :: T1,T2,clock_rate,clock_max integer :: neigh potential = 0.0d0 call system_clock(T1,clock_rate,clock_max) !dir$ offload begin target(mic:0) in(position,vlistl,numneigh) !$omp parallel do schedule(dynamic) reduction(+:potential) default(private),& !$omp& shared(position,neigh_alloc,vlistl,numneigh,rcut2,np) do i = 1,np x1 = position(i)%x; y1 = position(i)%y; z1 = position(i)%z !dir$ simd reduction(+:potential) do j = 1,numneigh(i) neigh = vlistl(j + neigh_alloc*(i-1)) x2 = position(neigh)%x; y2 = position(neigh)%y; z2 = position(neigh)%z dx = x2-x1 dy = y2-y1 dz = z2-z1 dr2 = dx*dx + dy*dy + dz*dz if(dr2.lt.rcut2)then dr2i = 1.0d0/dr2 dr6i = dr2i*dr2i*dr2i dr12i = dr6i*dr6i potential = potential + 4.0d0*(dr12i-dr6i) endif enddo enddo !$omp end parallel do !dir$ end offload call system_clock(T2,clock_rate,clock_max) print*,'elapsed time nint:',real(T2-T1)/real(clock_rate),potential end subroutine nearest_int
In my code I allocated vlistl as follows
neigh_alloc = 500 allocate(vlistl(500*np))
Now, although the -vec-report6 is telling me that the inner loop is indeed vectorized, I get a vectorization intensity of zero and horrible performance (barely beats serial, although this would make sense if the vectorization intensity is zero). The screen shots from the vtune analysis are given below in n-1.png and n-2.png. Here are my questions:
1. why I am getting this vectorization intensity inside a vectorized loop, and what can I do to improve my performance here?
2. If I can get this code to work on a MIC, I know to expect latency issues at line 27 (neigh = vlistl(j + neigh_alloc*(i-1)) in the O(n) algorithm. I would like to prefetch here. I know that the gather can bring in up to 8 pieces of data (I'm working in DP), on 16 cache lines. Can someone tell me the appropriate way to prefetch here to help mask the latency? I have fiddled with placing the following loop immediately after 27, but it didn't change the performance
do k = 0, 15 call mm_prefecth(position(vlistl(j + neigh_alloc*(i-1) + k+8)%x,1) enddo
3.I compiled with the -align array64byte flag so I believe all arrays should be aligned on 64byte boundaries. Does this mean that the arrays are also padded to a multiple of the cache line size? If not, how would I do this?
The simd pragma is supposedly getting the loop to vectorize. I sort my particles so that particles that are close in space are close in memory. I know the AOS structure isn't as good as SOA for vectorization, however I still was getting good performance using AOS in the O(n2 algorithm). Also, I know this is the data structure intel has implemented in various softwares employing this algorithm (lammps for example, although this is in c++ and not fortran). The full code compiles with (modules are attached). The subroutine of interest is the sole subroutine in module mod_force.f90. The numneigh and vlistl array are created in the subroutine build_neighbor_n2 in mod_neighbor.f90 and are allocated in subroutine init_list in mod_neighbor.f90. All arrays are defined as globals in the module global.f90
ifort -align array64byte -openmp global.f90 get_started.f90 mod_init_posit.f90 mod_neighbor.f90 mod_force.f90 MD.f90 -O3 -o new.out
Sorry for the long question/description. But I wanted to give a decent account of what I have tried already. Any help is appreciated.
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Fair enough. Do you have any suggestions for my bigger problem of getting poor performance relative to serial execution? I modified the code for the MIC so that I allocated only once before timing, and the subroutine took 0.006s on average. Serial performance takes about 0.038s giving me about a ~7 times speed relative to serial. This leaves much to be desired, given the amount of effort I have put into this thing. Vectorization intensity and memory latency are the issues here I am assuming. What is the vectorization intensity of zero telling me then? Also, do you have any suggestions for the proper prefetch method here? I have attached snapshots of the line by line code timing with prefetch and without prefetch. With my current prefetching strategy, the code runs on average 6.5e-3 s vs. 6.0e-3 s without prefetching. Since the memory latency appears to be one of the bottle necks, I would suspect some sort of prefetching to be beneficial. The book I am reading "Intel Xeon Phi Coprocessor Architecture and Tools", says of this algorithm without going into detail: "...a big performance may be gained using prefetch," and "data structures such as position can be aligned to cache line boundaries and padded to multiples of the cache line size for performance gains." The prefetch seems to be performing worse, and I am not sure if the -align array64byte is doing the padding, or what I need to do to do that.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Your basic problem with vectorization is your current data structures are organizes as Array Of Structures (type atom). Reorganizing into Structure Of Array format will facilitate vectorization.
type AtomCollection real, allocatable :: x(:) ! allocated to nParticles real, allocatable :: y(:) real, allocatable :: z(:) ... ! remaining properties end type AtomCollection
A second issue fighting with vectorization is (as a means of reducing the number of operations) you are using the array vlist, which appears to be containing a vector of indices representing the neighbors of interest (correct me if I am wrong on this). What this is doing is forcing the compiler to perform a gather operation, that is if it could meagerly vectorize the code, or perform the inner loop in scalar mode. A better choice of implementation would be create a series of list (index, x, y, z, ??), then use those. The compiler could then vectorize this. While you have 4x the data to create (over storing index alone), you will also consume 8x fewer iterations in your inner loop (performing 8-wide list of doubles).
The main stumbling block you have is you are fixated on having the implementation being organized as you view your abstraction. What you should be doing is viewing the "abstraction" solely as an abstraction and not as an implementation.
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Remember that the gather instruction on the current MIC architecture requires additional cycles for each cache line involved in the access. If your objective goes beyond achieving a vectorization report, and your inner loop is long enough to benefit from vectorization with alignment, you would consider packing those x,y,z components in linear arrays, as Jim mentioned.
If there is a problem with compiler generated prefetch in your current version, the "structure of arrays" might solve it without your having to look into it. As you say, your scheme of packing with memory locality may compensate for problems with prefetch. If you were interested in that, you might examine your VTune result hoping to see where missing prefetch might be impacting it, and whether your undisclosed prefetch usage is helping it.
Are you looking into whether the compiler has optimized your expression (dr12i-dr6i) ? I think the Fortran rules would allow optimizations such as dr6i*(dr6i-1), but I don't know why you would depend on that.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
As always, thank you for your help and responses. One of the reasons I initially elected an AOS over SOA is that intel still does not allow the transfer of structures that contain arrays to the xeon phi. By that I mean, when I try the following
type atomCollec double precision :: x(100000) double precision :: y(100000) double precision :: z(100000) end type atomCollec !dir$ attributes offload:mic::SOAposit type(atomCollec) :: SOAposit !dir$ offload_transfer begin target(mic:0) in(SOAposit: alloc_if(.true.) free_if(.false.))
I get that error which says this type of data is not transferable. However, I can do it if I use the following
double precision :: x(100000) double precision :: y(100000) double precision :: z(100000)
I am not 100% clear what the difference between the SOA representation and the three separate arrays representation is. However, I switched to this and ran vtune, and here is the summary:
time : 0.00609 s CPI: 13.341 vectorization intensity : 0.0 latency impact: 10655
I have checked a million times to make sure that I have the vectorization option checked on vtune. The poor vectorization aside, as you can see, the latency impact is extremely high suggesting that the prefetching done by time compiler is not sufficient even with the SOA representation. The pattern
neigh = vlistl(j+neigh_alloc*(i-1)) x2 = x(neigh); y2 = y(neigh); z2 = z(neigh)
Must not be getting picked up by the compiler. SHOC, an open source code which uses this type of algorithm on a MIC, uses the original AOS structure and prefetches as follows
_mm_prefetch((char*)&position[neighList[i * maxNeighbors + j + 0 + 16]], 1); _mm_prefetch((char*)&position[neighList[i * maxNeighbors + j + 1 + 16]], 1); _mm_prefetch((char*)&position[neighList[i * maxNeighbors + j + 2 + 16]], 1); _mm_prefetch((char*)&position[neighList[i * maxNeighbors + j + 3 + 16]], 1); _mm_prefetch((char*)&position[neighList[i * maxNeighbors + j + 4 + 16]], 1); _mm_prefetch((char*)&position[neighList[i * maxNeighbors + j + 5 + 16]], 1); _mm_prefetch((char*)&position[neighList[i * maxNeighbors + j + 6 + 16]], 1); _mm_prefetch((char*)&position[neighList[i * maxNeighbors + j + 7 + 16]], 1); _mm_prefetch((char*)&position[neighList[i * maxNeighbors + j + 8 + 16]], 1); _mm_prefetch((char*)&position[neighList[i * maxNeighbors + j + 9 + 16]], 1); _mm_prefetch((char*)&position[neighList[i * maxNeighbors + j + 10 + 16]], 1); _mm_prefetch((char*)&position[neighList[i * maxNeighbors + j + 11 + 16]], 1); _mm_prefetch((char*)&position[neighList[i * maxNeighbors + j + 12 + 16]], 1); _mm_prefetch((char*)&position[neighList[i * maxNeighbors + j + 13 + 16]], 1); _mm_prefetch((char*)&position[neighList[i * maxNeighbors + j + 14 + 16]], 1); _mm_prefetch((char*)&position[neighList[i * maxNeighbors + j + 15 + 16]], 1);
where their neighList is my vlistl. I originally tried this, but the compiler (I am using fortran) only would compiler of I did the following (note how I have the %x specified). Obviously, this prefetch would be done sixteen times like above.
call mm_prefetch(position(vlistl[(i-1) * neigh_alloc + j + 0 + 16]]%x, 1)
This resulted in worse performance. I don't know if my having to specify the %x is behaving differently than its c++ counterpart in SHOC.I am not quite sure how to partition the prefetch call directives between the three separate x,y,z arrays in the SOA case.
Unfortunately the alternate algorithm option suggested by Tim wouldn't quite work, because it would require additional maintenance of the array in another section of the code which would amount to the same amount of work in this subroutine. Hence, the work would be the same.
The lack of vectorization and prefetching is extremely perplexing.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Also, I should mention that the vtune analysis tells me that the most time consuming portion of the subroutine is the data load (shown below) by what looks like to be by a factor of 4 relative to the next most time consuming line:
x2 = x(neigh); y2 = y(neigh); z2 = z(neigh)
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Have you referred to Rakesh's article on indirect prefetch for Mic?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
No. The one I have been pouring over is intel's "management for optimal performance: alignment and prefetching."
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
In reference to the vectorization intensity. I have just noticed that when I run the vectorization report as
ifort -align array64byte -vec-report global.f90 mod_force.f90, I get
mod_force.f90(25): (col.8) remark: SIMD LOOP WAS VECTORIZED mod_force.f90(25): (col.8) remark: *MIC* SIMD LOOP WAS VECTORIZED
However, when I analyze the vectorization report as
ifort -align array64byte -vec-report6 global.f90 mod_force.f90 I get
mod_force.f90(25): (col.8) remark: SIMD LOOP WAS VECTORIZED ....some alignment statements.... mod_force.f90(25): (col.8) remark: *mic* loop was not vectorized: vectorization possible but seems inefficient mod_force.f90(25): (col.8) warning #13379: *mic* loop was not vectorized:
Is this indicating that the loop was actually not vectorized, although the lower number vectorization reports indicated that it was?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I'm curious which version of VTune is under discussion. I'm considering removing 2015 and reverting, e.g. to 2013 update 17, which worked well on MIC B0 with mpss 3.3 (but did only infrequently produce meaningful results for vectorization intensity). I've tried both 2015 updates 0 and 1; perhaps those are restricted to some recent combination of mpss and host OS or MIC hardware, although there's no warning to such effect.
If you're trying to optimize indirect access (even on host), the article I alluded to
https://software.intel.com/sites/default/files/managed/5d/f3/5.3-prefetching-on-mic-4.pdf
is worth reading.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Thank you for the reference. The version of vtune is the one stampede uses, which is indeed version 17. I am at a loss as to why the code is not vectorizing.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Connor,
There are number of factors that come into play when you analyze your application using Intel VTune Amplifier XE. Here are some knobs that you could play with:
1) Application Duration Estimate: In the advanced project properties, please select the correct application duration estimate (command line switch: -target-duration-type). The application duration estimate helps the analyzer select the correct Sample After Value (SAV). Selecting an appropriate SAV is critical to getting statistically correct results.
2) Disabling Multiplexing: Intel Xeon Phi Coprocessor has a small number of performance monitoring counters. As a result, whenever the analyzer needs to collect a larger number of events, it multiplexes the events during the application run. This multiplexing of events can again result in statistically invalid results. If your application performance does not vary significantly between runs then you can disable multiplexing. To disable multiplexing please select "Allow multiple runs" in the advanced project properties (command line: -allow-multiple-runs)
You can read more about statistical validity of results as well as SAV in this article.
Lastly, Vectorization Intensity metric has it's own corner cases, one of which happens to be scatter/gather instructions. I have documents 2 such corner cases in this article.
I hope this helps on the VTune front.
-Sumedh
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Sumedh, that did indeed work. I did the suggestions and saw that the L1 compute intensity was no longer zero. Motivated by you suggesting the total time of the run, instead of only running the subroutine five times, I ran it 1000 times. The vectorization intensity then came out as 6.926 with a CPI rate of 7.462. In the bottom up section of VTUNE when I am looking at the source code, I have a question about the vectorization intensity vtune reports for each line of code. For instance,
x2 = position(neigh)%; y2 = position(neigh)%y; z2 = position(neigh)%z dx = x2-x1 dy = y2-y1 dz = z2-z2 dr2 = dx*dx + dy*dy + dz*dz dr2i = 1.0d0/dr2 dr6i = dr2i*dr2i*dr2i
I would think that all the calculations I listed after the data load should be operating at 100% simd efficiency. However when I look at the vectorization intensity of these lines, I am seeing that I get a vectorization intensity of 3.56 for dx, 0.5 for dy, and 1.2 for dz. Since these are double precision numbers, I would expect these to be 8. Am I not interpreting this vectorization intensity correctly?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Connor,
The analyzer can run into issues when trying to link the sources to corresponding assembly because of compiler optimizations and such. So, i wouldn't blindly believe the vectorization intensity numbers for each source lines. You would need to drill down into assembly and verify that Intel VTune Amplifier XE is indeed correctly linking the source lines to assembly. In this particular case, I suspect that Intel VTune Amplifier XE is counting instructions from other source lines.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Fortran is (thankfully) not C++. In general, loops vectorize better in Fortran than C++ but there is a caveat. Some things that work well in C++ (like user defined types) make life more difficult in Fortran, at least as far as producing that efficient code Fortran is famous for.
You asked why you could use the alloc_if when you were using individual arrays but not if you were using a variable of user defined type containing those arrays. The simple answer is that a scalar variable of user defined type must be allocatable and bitwise copyable before you can use alloc_if. If you wanted to, you could try declaring SOAposit to be allocatable and then allocating it before you get to the offload directives. But there are simpler solutions.
Consider this code:
module globals type atomCollec double precision :: x(100000) double precision :: y(100000) double precision :: z(100000) end type atomCollec !dir$ attributes offload:mic::SOAposit type(atomCollec) :: SOAposit end module program huh use globals do i=1,100000 SOAposit%x(i) = i SOAposit%y(i) = i SOAposit%z(i) = i end do print *,"before ",SOAposit%z(6) !dir$ offload_transfer target(mic:0) in(SOAposit) !dir$ offload begin target(mic:0) out(SOAposit) do i=1,100000 SOAposit%x(i) = i+1 SOAposit%y(i) = i+1 SOAposit%z(i) = i+1 end do !dir$ end offload print *,"after ",SOAposit%z(6) end
In this case, SOAposit is a global with attribute offload:mic. It exists on the coprocessor from the time the process is created on the coprocessor until that process goes away. You still need to copy the data over but it is not necessary to allocate space for it because that is done by the variable declaration statement and because it is global, its value remains set between offload calls. I haven't played around with it enough to see if there are any "gotchas" that would cause the host and the coprocessor allocations to not be bitwise copyable, but I don't think you need to worry about that. (Others may contradict me on that.)
Personally, instead of the monolithic globals module, I would break it down into several modules by the purpose of the variables. (As a first pass, I might break it up at each point were you have a comment saying something like "these variables are for doing X".) Then instead of going through and adding an attribute statement for each variable, as needed, I would assign the offload attribute to the entire module for those modules that contained the variables needed on the coprocessor. This is mostly a matter of taste, but I think it might help with maintainability.
As far as the conflicting vectorization messages, -vec-report, by itself, only tells you what did vectorize, not what didn't. When you increase the report level, you get different information. Did the "*MIC* SIMD LOOP WAS VECTORIZED" comment actually disappear at the vec-report6 level? That surprises me. I would have expected:
mod_force.f90(25): (col.8) remark: SIMD LOOP WAS VECTORIZED mod_force.f90(25): (col.8) remark: *MIC* SIMD LOOP WAS VECTORIZED ....some alignment statements.... mod_force.f90(25): (col.8) remark: *mic* loop was not vectorized: vectorization possible but seems inefficient mod_force.f90(25): (col.8) warning #13379: *mic* loop was not vectorized:
What this would be saying is that the compiler generated both a vector and a scalar version of that loop. From the comments, it isn't really possible to determine which would be used when, but I suspect that it might be using the vector version when it thinks it knows enough about vlistl to know that the gather is worth it. (I believe there is a movement afoot among the developers to make the messages clearer, but don't quote me on that.) As you are using it here, however, I don't think the compiler has any confidence as to what the pattern of the gather would be and it worries that doing the gather might hurt more than it helps.
The _mm_prefetch is a C intrinsic. In C, it is inlined and efficient. In Fortran, you need to call it as a function or subroutine with all the overhead that entails. So, it is not surprising that using _mm_prefetch wasn't helpful. Good try though. There is not an equivalent for Fortran. In addition, if there is a way to make the code run well without having to explicitly specify the prefetch, that is a good thing. It increases the chances that your code will run well on future systems using the MIC architecture without the need to make changes - you will rely on the compiler to adjust for things like cache sizes.
I am trying to think of anything that could be done to vlistl to make the gather more efficient, but right now, all I can do is say that I'm with Jim on his approach - adding the x,y,z values to the vlistl - but I can understand your reasons for not wanting to do that.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Thank you so much for that response. That makes much more sense now. I have been pounding my head on a desk trying to determine why that ,although I could see via vtune that the time associated with the data load decreased when I prefetched, the total time was going up. The prefetching calls were killing the performance, due to subroutine call overhead. I should have known it was a subroutine since you have to write call. However, the books and example codes I have been looking at (all of which are c++) said to use prefetching, and I forgot to think about the difference between fortran and c++
You're SOA description did indeed work. Just to clarify, when it is global and I define it like
!dir$ attributes offload:mic :: SOAposit type(atomCollec) :: SOAposit
I never have to worry about the MIC deallocating the array for the entire duration of the code run (which will be on host with offloads to MIC)? I am running with offload obviously, so I will start and end jobs on the MIC throughout the code run.
I have been tinkering with the code since my last post, and now I can't seem to reproduce that vec output. It tells me now
mod_force.f90(25) :: *MIC* SIMD loop was vectorized ...some alignment stuff... mod_force.f90(25):: *MIC* remained loop was vectorized
I am pretty sure the issue earlier was the due to the fact that the compiler doesn't know that the trip count on the inner loop is big enough to benefit from vectorization. I.E. line 25
do j = 1,numneigh(i)
That must be why its generating two codes.
I have a question regarding the alignment output. If I do the AOS format
type atomCollec double precision :: x,y,z end type atomCollec type(atomCollec) :: position(100000)
and compile with:
ifort -align array64byte
I am curious as to why the vectorization output is telling me
mod_force : *MIC* vectorization support: gather was generated for the variable global_mp_position: indirect access
Why is it not saying
mod_force : *MIC* vectorization support: gather was generated for the variable global_mp_position: indirect access, 64 bit indexed
Is the lack of the 64 bit indexed output telling me that the AOS isn't getting padded and aligned to a 64 byte boundary?
One other question. Is there a nearest integer function that can be vectorized. For instance, if I wanted to do something like
double precision :: box dx = dx-box*ieee_rint(box*dx)
is there a function that can do this AND be vectorized. I am checking compiler outputs for nint,anint, and ieee_rint, and it doesn't look like those can be.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
nint and anint have the problem of not using the IEEE hardware rounding method (thus more complicated), with nint having the additional mixed data type problem. If ieee_rint doesn't generate vectorizable code, there is the ugly expression (expecting -assume protect_parens and default rounding mode to be set)
((x + sign(1/epsilon(x),x)) - sign(1/epsilon(x),x))
which should produce the same result as ieee_rint in nearly all cases (but you should check that your results are as expected). If nint was working, you shouldn't hit the cases where this expression rounds prematurely to multiples of 2.
Did you make a change in the usage of position(:) which you expected to eliminate the gather? You have been showing access with stride 3, which would be expected to compile to gather instructions. We were discussing already whether you wanted to check on how this was affecting prefetch.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
conor p. wrote:
You're SOA description did indeed work. Just to clarify, when it is global and I define it like
!dir$ attributes offload:mic :: SOAposit type(atomCollec) :: SOApositI never have to worry about the MIC deallocating the array for the entire duration of the code run (which will be on host with offloads to MIC)? I am running with offload obviously, so I will start and end jobs on the MIC throughout the code run.
Yes, if the memory is allocated as part of the global variable declaration, you never have to worry about the MIC deallocating the array for the entire duration of the code run. If the declaration does not allocate the space (if you declare the variable as allocatable or use a pointer type), you will want to use the alloc_if and free_if.
conor p. wrote:
I have a question regarding the alignment output. If I do the AOS format
type atomCollec double precision :: x,y,z end type atomCollec type(atomCollec) :: position(100000)and compile with:
ifort -align array64byte
I am curious as to why the vectorization output is telling me
mod_force : *MIC* vectorization support: gather was generated for the variable global_mp_position: indirect accessWhy is it not saying
mod_force : *MIC* vectorization support: gather was generated for the variable global_mp_position: indirect access, 64 bit indexedIs the lack of the 64 bit indexed output telling me that the AOS isn't getting padded and aligned to a 64 byte boundary?
There is no padding. The first element of the array is aligned. The remaining elements are not.
conor p. wrote:
One other question. Is there a nearest integer function that can be vectorized. For instance, if I wanted to do something like
double precision :: box dx = dx-box*ieee_rint(box*dx)is there a function that can do this AND be vectorized. I am checking compiler outputs for nint,anint, and ieee_rint, and it doesn't look like those can be.
I will do some more investigation, but for the particular example you give, I defer to Tim's answer.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
So if -align array64byte only aligns the first element of the array, what would I have to do to pad the data structure
type atomCollec double precision :: x,y,z end type atomCollec
or the neighbor list
integer :: vlistl(256*np)
so that they are multiples of the cache line size? Could this be why I am not seeing the 64 bit indexed, and just seeing indirect access?
Tim, I ran some performance tests using AOS and SOA data structure format. Although the SOA format did indeed give better vectorization as suspected, the data load time associated with it seemed to be worse than the AOS format. This led to slightly worse performance. So the code looks like
do i = 1,np x1 = position(i)%x; y1 = position(i)%y; z1 = position(i)%z do j = 1,numneigh(i) neigh = vlistl(j + neigh_alloc*(i-1)) x2 = position(neigh)%x; y2 = position(neigh)%y; z2 = position(neigh)%z ....compute stuff.... enddo enddo
Looking up the pointer neigh must be whats giving indirect access, which makes sense. I am just trying to figure out why the 64 bit indexed isn't showing up.
As to the nearest integer function, the compiler did not generate any warnings saying it wasn't vectorized. However when I ran it in vtune, there was no vectorization intensity shown for those lines. However one of the lessons I have taken away from this thread is to be very suspicious of vtune's vectorization report.

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page