Vectorization : unaligned access to dynamic allocatable arrays

mguy44 · ‎05-11-2015

Hello, I want to improve performances of my F90 application running on recent Intel Xeon processors. I use the -vec-report6 option and I get messages like : ./SRC/flux.f(106): (col. 10) remark: vectorization support: reference cominc_mpi_mp_q1_ has unaligned access ./SRC/flux.f(112): (col. 10) remark: vectorization support: reference cominc_mpi_mp_q1_ has aligned access ./SRC/flux.f(113): (col. 10) remark: vectorization support: reference mod_flux_mp_fc_ has unaligned access ./SRC/flux.f(116): (col. 10) remark: vectorization support: reference mod_flux_mp_fc_ has unaligned access ./SRC/flux.f(116): (col. 10) remark: vectorization support: reference cominc_mpi_mp_q1_ has aligned access ./SRC/flux.f(117): (col. 10) remark: vectorization support: reference mod_flux_mp_fc_ has unaligned access ./SRC/flux.f(117): (col. 10) remark: vectorization support: unaligned access used inside loop body ./SRC/flux.f(105): (col. 7) remark: LOOP WAS VECTORIZED The loop is the following :

DO i = N1-3, N2+3 R = Q1(i,5) RRq = one / Q1(i,5) Ux = Q1(i,1) * RRq Uy = Q1(i,2) * RRq Uz = Q1(i,3) * RRq Un = Ux * riX + Uy * riY + Uz * riZ P = Gm1 * (Q1(i,4) - half * R * (Ux*Ux + Uy*Uy + Uz*Uz) ) Fc(i,1) = R * Ux * Un + P * riX Fc(i,2) = R * Uy * Un + P * riY Fc(i,3) = R * Uz * Un + P * riZ Fc(i,4) = (Q1(i,4) + P) * Un Fc(i,5) = R * Un END DO

Arrays Q1 and Fc are dynamically allocated at runtime they are declared in modules, like

!dir$ attributes align:64 :: Fc, Q1 Real(8), Dimension(:,:), Allocatable :: Fc Real(8), Dimension(:,:), Allocatable :: Q1

Allocation has the form

ALLOCATE (Fc(-2:Nmax+3,5), Q1(-2:Nmax+3,5), ...)

The compilation sequence is :

mpiifort -xHost -O3 -inline-forceinline -pad -opt-prefetch -mp1 -ftz -unroll-aggressive -132 -module ./obj_O3 -I./obj_O3 -I. -implicitnone -traceback -g -sox -fpp -vec-report6 -c ./SRC/flux.f -o./obj_O3/flux.o

I know that data alignement is important in vectorization so I'd like to know if there is something that could be done to improve this ? Thank you for your advices. Regards, Guy.

jimdempseyatthecove · ‎05-11-2015

Your allocation places Fc(-2, 1) at a byte location that is a multiple of 64 bytes

Your DO loop begins at N1-3.

How is the compiler to know that N1-3 == -2? (always)

Jim Dempsey

mguy44 · ‎05-11-2015

Hello Jim,

The compiler can't know this. In fact this is always true.

Could a directive like

!DIR$ VECTOR ALIGNED

be useful in this case ?

Regards,

Guy.

jimdempseyatthecove · ‎05-11-2015

The last line in the report output you show states "LOOP WAS VECTOREIZED"

For loops like this compiler will (can) generate both a vectorized and non-vectorized loops as well as a section of code called "peel" (for some loops the loop promising for vectorization) that processes the loop in scalar mode up until the data is known to be aligned, then continues processing the loop with vectorized code, and then finally, when there is a remainder that is unaligned, it will have code that processes the remainder in scalar mode.

In looking at your allocation you have

ALLOCATE (Fc(-2:Nmax+3,5), Q1(-2:Nmax+3,5), ...)

When you allocate multi-dimensioned arrays, it is advantageous to make sure your first dimension (Fortran) is a multiple of variables of that type that fit within a vector. This may require you adding pad data to that dimension. Before you do this in all places in your programs, it would be beneficial to experiment with a loop such as above, with and without the pad. The peel code is fairly efficient, you may find that for large Nmax, the overhead for peel will not be noticed.

Jim Dempsey

TimP · ‎05-12-2015

Vector aligned directive asserts that each rank of a multi rank array is aligned, suggesting the compiler not to take care of the case where it is not. if the loop already reports vectorized, you may not see an advantage in this.

mguy44 · ‎07-08-2015

Hello,

thank you for all your advices.

I have written several versions of the application, regarding the order of the dimensions in the most important arrays. I have modified the way their first dimension is declared : to be a multiple of 8 elements (DP floats).

Now I have chosen the order that give me best performances. And every thing fine for this.

Regards,

Guy.