Coarrays, parallelization, vectorization

David_DiLaura1 · ‎06-13-2014

Colleagues,

This is meant to be a summary of one coding group's experience with the above 3 aspects of Fortran programming and to elicit other opinions, experiences, and insights.

Background: Our group provides commercially available software for the building design and construction industries. Our most extensive code is, essentially, an elaborate radiative transfer analysis of a building. Input is user-produced CAD along with supporting data that describes building equipment. Essential and important computational tasks involve computational geometry, radiative transfer analysis, setting up and solving systems of equations, and so on. Typical tasks of most large-scale engineering analysis systems. Over the past 10 years we have generated and now modify/maintain/update about 250,000 lines of code.

Vectorization: This has proved to be (for our work, at least) the most important and efficacious optimization technique -- by far. The analysis of even a modest-sized project involves 10's of millions of dot-products, cross-products, and geometric bounds-checking. Good practice of a decade ago had data arranged so that, say, the x,y,z Cartesian coordinates of a vertex were contiguous in memory: x:y:z. Best for dot-products and cross-products. Now is it best to arrange arrays so that all the x coordinates are contiguous: x1:x2:x3: . . .:xn. Similarly for y and z. Or at least maintain a duplicate data set with the coordinates arranged so. The processing of the x-part of a large set of dot-products is then vectorizable.

DotProd(1:N) =  CoorA(1:N,1)*CoorB(1:N,1)+CoorA(1:N,2)*CoorB(1:N,2)+CoorA(1:N,3)*CoorB(1:N,3)

Where CoorA(1:N,1) points to the x-coordinates of all N surfaces; and so on. We have found the speed-up to be larger than that expected from just the use of SIMD (4, in our case). Evidently, memory is (much) better used/accessed in this way. In general, we have found this to be (much) faster, even when some of the dot-products produced are not used or inappropriate. That is, it's better to throw away some of the vectorized results, than to trouble not computing them. We have found the speed-up even greater for the cross-product intensive part of our code.

We have found that axis-aligned bounding box checking is an important opportunity for vectorization: InOut is a vector of integers

InOut(1:N) = ( Coor(1:N,1) < BoxMaxX )*( Coor(1:N,2) < BoxMaxY )*( Coor(1:N,3) < BoxMaxZ)

The check against the bounding box minimum coordinates can be (often is) concatenated onto the Max check. In general (that is, statistically) we find this is to be considerably faster than an explicit, early-out loop that checks x, then, y, then z. Obviously, if any of the checks fail, the value of InOut for that surface will be zero. In this regard, we have been looking for an efficient way to pack the zeros out of a long vector -- without success so far. The intrinsic Pack routine is hopelessly slow. We also wonder (we've made no investigation yet) whether such results are better stored in vectors of smaller individual element byte length; 2-byte integers, 1-byte integers. If the results are later operated on repeatedly, and SIMD is used, then instead of 4-at-a-time, testing/evaluating/check can be done 8-at-a-time, or in even large clumps.

All this is obvious. But what is important (to us, at least) is that in general, in practice (statistically ,for most projects) the speed-up is significant and worth the significant and wide-spread changes in code required. This is an important aspect for those dealing with valuable, legacy code. And to some extent it requires a different type of thought (maybe even different algorithms) when generating new code. We imagine that as the SIMD registers get larger, these effects will be even more pronounced.

Parallelization: We have found that in general, and for our code, parallelization by threading is essentially useless. (Our team jokes that parallelization/OpenMP isn't a false promise, it's a cruel hoax). To be sure there is lots of evidence that there are many cases where sharing work in multiple threads is very efficacious. But we find, almost always, that the overhead involved completely swamps whatever gain their might be. Some of this is due to the nature of what we are computing. There are very, very few places in our analysis where the work to be done is "tight"; that is, expressible or accomplishable with just a few operations and so just a few lines of code. As when one multiplies a matrix, or is manipulating 10^8 pixels in an image. In general, the work to be done in our code is elaborate and so the work necessary to establish threads is also elaborate. If, for example, we have 10^4 surfaces, then we have 10^8 occlusion analyses to do (can one surface "see" another?). There might be 10^3 potential blocking surfaces to check, with each check requiring a relatively elaborate analysis. By the time we back out of the nested loops far enough to prevent overhead/setup time from being prohibitive, it proves better (by far) to use the Coarray Fortran paradigm. We are particularly interested in others' experience (and advice!) in this regard.

Having written that, I should add that there are some (very few) times when threading is efficacious: as in matrix multiplication. By-the-way, if you are interested in a crystal-clear, practical, detailed exposition of how such a task can be handled, we suggest you view the series of videos that Jim Demsey (frequent and important contributor to this forum) has produced. You can find the link at his web site.

In general, we have found that evaluations of various optimization techniques that use matrix multiplication are not useful, because they are NOT indicative of what is required for scientific/engineering work that involves repeated use of an elaborate or lengthy process. I don't mean to sound silly, but we no longer pay attention to claims (or evaluations) that involve matrix multiplication. The problem is, in many ways, trivial and not sufficiently indicative. The difficult and expensive work is setting up the matrices or system of equations, not multiplying the matrices or solving the system.

Coarray Fortran: We have had considerable success with this. Very considerable. Our approach does not focus on the shared data between images (the coarrays), but rather the opportunity to have multiple instances of (very nearly) identical code working on pieces of very large problems. We note the following. The most difficult part of making effective use of multiple images is to predict the work load. We have had to spend considerable time developing quick, effective ways to predict work and so generate more-or-less even workloads for each image. In our case, for example, simple functions involving surface area, orientation, square of separating distance, and so on. This turns out to be important (and non-trivial) since it doesn't help to have 1 or 2 of the images doing all the heavy lifting. In this regard, we have found it useful to have a non-coarray Fortran program do an initial analysis and determine workload, and then have it launch a coarray Fortran program is establishes multiple images and performs the work.

As Steve Lionel has mentioned several times, The implementation of coarrays in the Intel Fortran compiler is a work in progress, and aspects of it will improve over time. For the present, we find the communication between images using coarrays directly to be too slow. Communcation using files is faster. (We were surprised, too). This may change. Currently, we limit communication between images that uses coarrays (usually at the start and end of the work to be done), and each image writes its result to a file. The "launcher" Fortran program (having waited for all images to finish) then gathers the result into a single, neat package.

We have found it important to limit the number of images to the number of physical core present on the host machine. Using the virtual cores in addition to the physical ones generally slows the overall process. And so, setting the appropriate number-of-cores environment variable is very important since we have found the slowing effect can be considerable. Several months ago, Steve provided a routine that can be called from Fortran that returns this information about a host.

We strongly suspect that Coarray Fortan will like be our team's most significant investment in optimizing our engineering code in the furture.

Perhaps I should apologize for such a long post, but it is a very interesting subject and we are interested other's experiences and findings.

David

TimP · ‎06-13-2014

Fortran PACK is useful on Intel(r) Xeon Phi(tm) on account of specific hardware support. As you say, it's likely to be slower than sequential DO loop code on Xeon.

Advantage of short data types would be in reducing cache and memory use, so it's highly problem-dependent, but no advantage expected unless there is cache limitation.

OpenMP parallelization may not be useful if it is introduced at the expense of vectorization, although that was the way it was touted when Intel originally began introducing multi-thread hardware support. For it to be interesting in concert with vectorization, the problem size theshold is higher than for vectorization alone, e.g. outer parallel loop count > 1000 together with inner vector loop count > 1000. (2000 x 2000 for Intel(r) Xeon Phi(tm).

Matrix multiplication is well known as a unique situation in which the memory bandwidth usage may be negligible, and great advantage can be taken by expert low level coding. By contrast, I see many problems where the best we can do with the VTune analysis is to see 2 to 3 uses of each array element per memory access. As a result, there may be many cases where the speedup by using both threaded parallel and vectorization is only 2-3 times what is possible with either optimization alone.

Calvin_D__R_ · ‎06-14-2014

David:

What hardware and what OS are you using?

jimdempseyatthecove · ‎06-14-2014

Thank you for the kind reference.

From one perspective I see no functional difference between using coarrays and using OpenMP. I am going to make an intuitive leap in assumption that your optimization efforts were: Optimization strategies for vectorization are to be used for optimization strategies for OpenMP. (What is good for the goose is good for the gander.)

Effectively meaning: Look for hot spots and apply technique (reorganizing data layout as you did with swapping the position of the X, X, Z index), then apply the optimization technique at the hot spot.

As you have seen (experienced) this is not necessarily the strategy to use with OpenMP. In fact, this is most often the most unproductive use of OpenMP.

Take a step back and look at what you did to adapt (convert) your application to use coarrays. Effectively what you did is reached in and pulled the parallelization out from the inner hot loops all the way up to the PROGRAM level. IOW the compartmentalization decision of work distribution made and (or near) program start.

When you optimize for OpenMP, you should follow a similar set of rules. Principally, when necessary, reorganize your data and thinking, such that you can lift your parallelization from the inner hot loops to the highest level that proves (or is anticipated as being) optimal (given the number of hardware threads).

OpenMP has an advantage over coarrays in that you have greater flexibility in choosing where and how to perform the work compartmentalization. There is a learning curve in how to use this advantage. As you have found out, OpenMP can be used ineffectively as well.

Although I do not suggest you program this way, you could conceptually view coarrays as:

real :: array(nPoints, 3, 0:nThreads-1)
PROGRAM
!$OMP PARALLEL PRIVATE(iThread)
iThread = omp_get_thread_num()
...
array(I, J, iThread) = ...
...
!$OMP END PARALLEL
END PROGRAM

*** I strongly suggest not to do the above, as the above meant as a conceptualization to relate coarray with OpenMP. This is not the place to say why this would be a bad choice.

In OpenMP, your likely strategy might be

real :: array(nPoints, 3, nSurfaces)
...
!$OMP PARALLEL DO PRIVATE(mySurface, otherSurface)
DO mySurface = 1, nSurfaces
   DO otherSurface = 1, nSurfaces
      CALL doWork(mySurface, otherSurface)
      ...
   END DO
END DO
!$OMP END PARALLEL DO

The above has similarities to coarrays, however the partitioning is different. The sub-range of surfaces per thread is performed at the point of the !$OMP PARALLEL DO, rather then at the start of the program. Additionally, you have effectively what would be direct inter-image access (inner DO loop) without the messaging overhead of coarrays (MPI).

Give OpenMP another look.

Jim Dempsey