Re: Vectorization Optimization

Dishaw__Jim · ‎02-04-2008

I am trying to structure my code to take advantage of the vectorized operations. I have an angular quadrature of order N and, in most cases, the angular quadrature is symmetric (the number of ordinates in the positive direction is equal to the number or ordinates in the negative direction). I want to gracefully handle the case where the ordinates might be asymmetric.

If the ordinates set is symmetric I could have a data structure in the form Z(N/2,D,F), where F is the number of faces and D is the direction. A common operation in my code is

j = Z(:,Positive_Direction,i) * scalar
... do some stuff with j

For an asymmetric set I can either be efficient in memory utilization and have Z(N,F), where the two directions have been "merged," or have Z(Np,D,F), where Np=MAX(# of ordinates in positive or negative directions). The latter approach will waste some memory in the asymmetric case, but it would keep the code straightforward. The former approach would require the use of array sections or the use of a WHERE clause.

The last time I looked at the assembly code produced by array sections and WHERE clauses, temporary arrays were generated. I think it would be better to pay the price of the additional memory utilization and avoid the array temporaries because a symmetric quadrature is the most common case. Is this a reasonable assessment of how the Intel compiler optimizes the code?

jimdempseyatthecove · ‎02-04-2008

>> j = Z(:,Positive_Direction,i) * scalar

Should vectorize assuming compiled with appropriate options and assuming the range expressed by : is larger than a handful.

Typically the vectorization introduces some overhead in determining alignment of 1st element and determination if number of (remaining) elements is modulus of vector packing, 2 or 4 depending if real(8) or real(4). This overhead is paid back only if the number of elements is ~8 or more.

If you can assure that Z(:,:,:) is always vector aligned then you can use a compiler directive to induce vectorization (hopefully) without the alignment overhead.

It doesn't hurt to check the disassembly window to see what was produced.

Jim Dempsey

Dishaw__Jim · ‎02-04-2008

Typically the vectorization introduces some overhead in determining alignment of 1st element and determination if number of (remaining) elements is modulus of vector packing, 2 or 4 depending if real(8) or real(4). This overhead is paid back only if the number of elements is ~8 or more.

If you can assure that Z(:,:,:) is always vector aligned then you can use a compiler directive to induce vectorization (hopefully) without the alignment overhead.

I thought much of the floating point improvement in the new Intel processors was due to vectorization (SSE3). If I understand correctly, if there are less than 8 elements then vectorization will be slower than executing the instructions in a loop? Because I am using REAL(8), I need to keep the vector sized by modulo 2 to reduce the overhead--correct?

TimP · ‎02-04-2008

WHERE, in the usual situations with a logical expression, implies a temporary logical array. It may be better to avoid WHERE by use of f77 code, particularly when, as Jim said, you can use VECTOR ALIGNED directive. WHERE is one of those f95 constructs which is attractive only if it improves readability, usually at the expense of performance.

jimdempseyatthecove · ‎02-05-2008

I forgot to mention that there is another issue that can interfere with vectorization.

In FORTRAN an array can have a descriptor containing stride

real :: Array(1:100:10)

Describes an array of 11 elements: A(1), A(10), A(20), ... A(100)

If a subroutine declares an array argument as

subroutine(A)
real :: A(:)

Then the compiler might not know the stride of the array at compile time. Not knowing the stride makes it difficult to vectorize the code.

You can assist the vectorization by declaring your arrayarguments in a manner that is explicit as to the bounds and stride.

Jim Dempsey

TimP · ‎02-05-2008

SSE vectorization is seldom possible except at stride 1. Not seeing the entire example Jim has in mind, I would think that this example would cause a stride 1 copy of the array to be passed to the subroutine, at least in the case where interprocedural optimization is not invoked, and no interface block or module is in effect. Without interprocedural optimization, the compiler can take advantage of possible favorable alignment only by generating multiple versions of the vectorized code. The time required to choose the version at run time could be significant for loops of moderate length.

Dishaw__Jim · ‎02-05-2008

Here is description of the most time consuming part of my code (and more complicated then the simple example presented earlier). I have to sweep across the spatial mesh for each ordinate in the angular quadrature. The sweeping has to be done in "flow-order," e.g. in XYZ geometry when sweeping from the [back, left, bottom] to the [front, right, top] only the positive direction cosines are used. The angular quadratures can be low order (e.g. 4 ordinates per octant) to "high-order" (32 - 128 ordinates per octant). If a first-moment spatial quadrature is used, then each ordinate has two values (zeroth and first moments). In XYZ geometry there are three inflow faces and three outflow faces. Thus, in XYZ geometry using 4 ordinates per octant and a zeroth moment spatial quadrature, each cell is represented by an inflow vector and an outflow vector of length 12. The transport matrix is a sparse 12 by 12 matrix.

The way I have written the code now, I copy the relevant ordinates from an array containing all the ordinates for each cell face into a smaller vector that contains the inflows for the current cell. I then construct the transport matrix and compute the outflows for the cell using a MATMUL. The outflows are then copied to the larger array.

Because the transport matrix is sparse, the sparse BLAS routines in MKL might be the best approach (if I recall correctly the sparse BLAS can be faster then MATMUL around 10 by 10). Alternatively, because of the copying from and to the larger array, coupled with the sparseness of the transport matrix, I am also considering doing the arithmetic directly in the code and avoid the MATMUL or a sparse BLAS call (it would fit in a FORALL).

The bottom line is I'm trying to understand how the compiler vectorizes and the overhead cost.

jimdempseyatthecove · ‎02-05-2008

You might find the code will run better without the copying of the data to/from a transport matrix. The copy operation will eat up any savings by using a vectorized MATMUL.

Here is another suggestion:

Your geometries are unlikely to change during a run. Why not keep the data internaly in a format usable by MATMUL and other functions. Then only perform the tranformation from internal format to XYZ format when you need that format. You may find that the conversion function(X,Y,Z) to produce a simple index into the transformation array of less computational requirements than the copying to/from a transport matrix.

Jim Dempsey