Intel® Fortran Compiler
Build applications that can scale for the future with optimized code designed for Intel® Xeon® and compatible processors.
28555 Discussions

## "Best" code form for ensuring vectorization?

New Contributor I
2,667 Views

Hello,

I have a (probably basic and very easy) question corresponding to effective and efficient code vectorization in the Fortran compiler.

I have the following line of code, whose goal is to give the value of a vector a, in terms of 6 other vectors b, c, d, e, f and g (all vectors have a dimension of 256):

a(:) = b(:)*c(:) + d(:)*e(:) + f(:)*g(:)

When I build the code, I see that the optimization report mentions that this line of code has been vectorized.

My question is: should I pursue shaping my code differently to ensure the maximum possible vectorization speedup? For instance, would I get a better vectorized program if I defined vector a in 3 stages as follows:

a(:) = b(:)*c(:)

a(:) = a(:) + d(:)*e(:)

a(:) = a(:) + f(:)*g(:)

Also, does it make any difference if I write the command " a(:) = b(:)*c(:) ", instead of having an explicit do loop:

do i = 1,256

a(i) = b(i)*c(i) + d(i)*e(i) + f(i)*g(i)

end do

Many thanks in advance for the help.

24 Replies
Honored Contributor III
2,326 Views

Please don't use (:) to indicate a whole array. It complicates the compiler's analysis. Doctor, it hurts when I do this! - Doctor Fortran (stevelionel.com)

Honored Contributor III
2,301 Views

Steve, many releases ago, but after when realloc lhs was implemented, if you did not use (:) on the lhs .AND. lhs need not be reallocated, the compiler would generate code with always reallocate, thus a performance hit with the unnecessary reallocation and unnecessary memory fragmentation. The use of (:) should not complicate the compiler analysis as this is a clear indication to .NOT. reallocate.

Note, the compiler could (with appropriate runtime diagnostics enabled) generate an assert in the code to assure the size of the lhs matches the size of the expression on the rhs. This would have significantly less overhead as well as no impact on heap fragmentation (and no critical section on MT applicaitions).

Question: if an allocatable array is also attributed as TARGET, will this inhibit reallocation on lhs? (though it may permit first allocation on lhs).

IOW if one sees unnecessary reallocation, they could flag the array with TARGET to avoid the reallocation (though (:) would be preferable).

Jim Dempsey

New Contributor I
2,296 Views

Steve, Jim,

Thank you for the input.

The question I still had is: does it make any difference if I define the operation as "1-step":

a(:) = b(:)*c(:) + d(:)*e(:) + f(:)*g(:)

as opposed to 3-steps:

a(:) = b(:)*c(:)

a(:) = a(:) + d(:)*e(:)

a(:) = a(:) + f(:)*g(:)

I had another question. Right now, I am defining the size of the vectors through a PARAMETER definition:

integer, parameter:: Nvec = 256

real*8 a(Nvec), b(Nvec), c(Nvec), d(Nvec)

real*8 e(Nvec) f(Nvec), g(Nvec)

The reason is to be able (in the future) to try changing the size of the block that I can work on with the vectorized code and benchmark the performance (e.g., use 128- or 512- element block vectors etc.).  Is this coding approach "good" for the compiler (= does the compiler see the "fixed" size of the arrays during compilation time?). Or is there another way to define these "fixed-size arrays" to make the compiler's life easier?

Honored Contributor III
2,269 Views

Jim, you are correct that (:) on the LHS for an allocatable array will prevent the potential reallocation - I discussed that in the post I linked. (TARGET will not.) However, you are mistaken that the compiler "always" reallocated in older versions - it would always check the shapes, but it would be an error to force unnecessary reallocations.

Honored Contributor III
2,279 Views

With compiler optimizations enabled, it is very likely that it will generate the same vectorize code for:

``````1) a(:) = b(:)*c(:) + d(:)*e(:) + f(:)*g(:)
2) a = b*c + d*e + f*g
3) a(:) = b(:)*c(:)
a(:) = a(:) + d(:)*e(:)
a(:) = a(:) + f(:)*g(:)
4) do i=1, size(a)
a(i) = b(i)*c(i) + d(i)*e(i) + f(i)*g(i)
end do
5) do i=1, size(a)
a(i) = b(i)*c(i)
a(i) = a(i) + d(i)*e(i)
a(i) = a(i) + f(i)*g(i)
end do
6) a = b*c
a = a + d*e
a = a + f*g``````

This said, "very likely" might not be the case moreso for ifx as opposed to ifort as ifx is less mature than ifort.

formats 2, 4, 1, in that order, are more likely than 6, 5 and 3

Rather than speculate, write up some test code and run it with VTune with a resonable array size and iteration of operation on those array sizes.  Use /QxHOST

Also note, compiler optimizations are continuously improved. Thus what doesn't work today, may work later on.

You will have to consider writing code that prefers elegance or speed (with today's optimization capabilities).

With respect to optimizations, you may need to consider what you do with a(:) after it is filled in. It may be shortly later you will be performing a sqrt on it. In that case, it would be advisable to test placing the sqrt inside the loop verses outside the loop as well as double buffering tiles of the loop performing the dot_product along side the sqrt (one phase later).

Additionally, note, in the event that you are computing 1.0/sqrt(a) that most current CPUs have instructions that generate this result directly, but at differing precision. So this has to be factored in with your decision. IOW you have to consider:

elegance vs speed vs precision

Jim Dempsey

Honored Contributor III
2,265 Views

>>you are mistaken that the compiler "always" reallocated in older versions - it would always check the shapes

"always" was too strong of term for me to use. "It has been observed" would have been a better choice of words.

This forum has, on multiple occasions, had postings of unnecessary reallocation.

As to if this is currently true, I cannot say. In any event, the use of (:) on lhs (or either side for that matter) should not present a problem for the compiler writers.

There is one case where use of (:) on lhs and check size would be advisable. This would be where at least one (or more) of the arguments has assumed size. In this case the runtime check should be there (or available) as the sizes could not be determined at compile time.

Happy Easter

Jim Dempsey

Honored Contributor III
2,247 Views

With assumed-size, the size is not available at runtime either (absent some implementation-specific method of passing that info).

New Contributor I
2,061 Views

Steve and Jim,

I had a follow-up question regarding vectorization. I have been using the optimization diagnostics of the compiler, to gain insights on how "efficient" my pursuit of vectorization might be.

I have a set of vectors (each having very large dimension) defined (as ALLOCATABLE vectors) in a MODULE.

This module is then used in the subroutine that is meant to conduct vectorized computations. I basically define a PARAMETER Nvec1 = 128 (so the idea is to operate on chunks of vectors having a length of 128 - obviously, I can change this parameter at a later stage to see how an increase or decrease in the size of the vector chunks may affect performance).

I have the following declarations in my subroutine:

integer, parameter:: Nvec1 = 128

real*8 A1(Nvec1),A2(Nvec1),A3(Nvec1)
real*8 A4(Nvec1),A5(Nvec1),A6(Nvec1)

Now, I want to conduct computations using chunks of the much larger vectors (as I mentioned above).

The much larger vectors that I have in my MODULE are: dx1(), dy1(), dz1(), dx2(), dy2(), dz2(), nx(), ny(), nz()

By "much larger", I mean that they have much greater dimension than 128, and this dimension is computed at runtime; thus, as mentioned above, they are defined as ALLOCATABLE vectors in the MODULE.

Now, I have the following code in my subroutine:

Nvec = Nvec1

do J = 1,Nvec
A1(J) = dy1(iel+J) * nz(iel+J) - dz1(iel+J) * ny(iel+J)
A2(J) = dz1(iel+J) * nx(iel+J) - dx1(iel+J) * nz(iel+J)
A3(J) = dx1(iel+J) * ny(iel+J) - dy1(iel+J) * nx(iel+J)
A4(J) = dy2(iel+J) * nz(iel+J) - dz2(iel+J) * ny(iel+J)
A5(J) = dz2(iel+J) * nx(iel+J) - dx2(iel+J) * nz(iel+J)
A6(J) = dx2(iel+J) * ny(iel+J) - dy2(iel+J) * nx(iel+J)
end do !J

In the above code, iel is an integer number (which allows me to access specific target "chunks" from the much larger vectors).

The compiler optimization diagnostic gives me the following messages:

15300: LOOP WAS VECTORIZED (line:234, column:7)

15449: unmasked aligned unit stride stores: 6 (line:234, column:7)

It gives me other messages, too, but the one I am concerned about is the one referring to "unaligned unit stride loads". I am not sure why this would be happening, as the "large" vectors are "contiguously" stored in memory (assuming that this has to do with the issue).

I tried modifying my code as follows:

A1(1:Nvec) = dy1(iel+1:iel+Nvec) * nz(iel+1:iel+Nvec)  - dz1(iel+1:iel+Nvec) * ny(iel+1:iel+Nvec)

etc. (similar commands for A2, A3 etc.), and I still got similar diagnostics messages (about "unaligned unit stride loads").

My questions are as follows:

1) Will the specific "unaligned unit stride loads" issue turn out to have impact on the vectorization speedup (or it simply pertains to the "compilers prediction for speedup", and in actual computation it would not have impact)?

2) If the answer to question 1) is "yes", is there a way to resolve the issue? For example, to explicitly inform the compiler that the "large vectors" are indeed CONTIGUOUS in memory, when I declare them as allocatable arrays in the MODULE?

Moderator
2,046 Views

2 steps:

1) compile the code with -align array64byte
this takes care of making sure your data is in actuality aligned.

2) For each vector loop, if you have OMP SIMD directive (do you?) include the ALIGNED clause.  If you are not using OMP SIMD directives you should consider doing so. If you are not using OMP SIMD directives, you must tell the compiler AT EACH LOOP that the data is aligned.  If you don't want to learn OMP SIMD, you can read this.

CONTIGUOUS attribute can be used on dummy arguments IFF you can guarantee that the data is truly aligned.  That is, do not pass actual args that are array slices with strides.

THere is a really old article here on the Intel proprietary directives.

I would highly encourage you to read up on OpenMP SIMD and use that instead.  AND you do need compiler option -align array64byte.  You should be using that everywhere on all source files.  Read this older tutorial on OMP 4.5, chapter 3 on OMP SIMD.

Honored Contributor III
2,034 Views

The compiler (likely) doesn't have sufficient information as to the values of your offset iel are a multiple of 128.

You will have to address this in your code, and if that is insufficient, !DIR\$ the loop

!DIR\$ ASSUME (MOD(iel,128) == 1)

EDIT: OOPS Because +J starts at 1, use (MOD(iel,128) == 0)

This does require that iel is indeed a multiple of 128 biased by 1.

Note, if you intend to OpenMP this, then remove the iel+J and run J for the whole loop but use static scheduling with chunksize of 128 (or whatever you find is best).

Jim Dempsey

New Contributor I
2,025 Views

Ron and Jim, thank you for the help and advice.

Jim, I had a couple follow-up questions (and I apologize for them probably being very basic questions).

1. If I follow your approach, do I still need to use the -align arrray64byte option that Ron mentioned?
2. Is there a reference/documentation for me to consult for using OpenMP SIMD with FORTRAN? I believe Ron sent me a link to documentation which emphasizes C++.
3. Finally, is it possible for you to explain to me how an OpenMP SIMD (or openMP with static scheduling for a specific chunksize) would work for the specific code that I had? For instance, how would I have to modify the code for the following loop:

do J = 1,Nvec
A1(J) = dy1(iel+J) * nz(iel+J) - dz1(iel+J) * ny(iel+J)
end do !J

where A1() is a vector with Nvec values, while all vectors in the right-hand side have a much larger dimension.

Again, many thanks for all the advice.

Yannis

Honored Contributor III
2,007 Views

>>If I follow your approach, do I still need to use the -align arrray64byte option that Ron mentioned?

That will provide SIMD cache line alignment by default. In your case your (unstated) objective is to tile your data in larger 128 cells (unstated at to sizeof cell as to if it is 4 or 8 bytes).

!dir\$ attributes align: 1024 :: dy1, nz, dz1, ny

might be better (you may need to independently align the arrays).

However, tiling (to keep more data access in L1/L2 cache) generally works when there are two or more indices used in your expressions. Single thread code can benefit from tiling (as well as vectorization) as well as multiple threads working on different tiles.

With respect to multi-threading, in the perfect world (it ain't) all threads can do the same amount of work in the same amount of time. The easy form of tiling would be to evenly split the work amongst the threads. But this likely won't yield the best results. Until recently, other activities on the system (thread preemption) as well as adverse false sharing causing cache line eviction would cause additional latencies in completing a tile, the newer CPU's not only have Turbo Boost and thermal throttling but also "Performance Cores" and "Efficiency Cores" meaning average throughput per hardware thread is not equal.

In this (these) circumstance(s), a better strategy might be more tiles than threads together with dynamic scheduling.  As to what constitutes more, this is a tuning exercise.

The strategy here, is to try to keep all threads working, regardless of how fast or slow they work .AND. that they finish nearly at the same time. (no thread is waiting around when there is nothing left to do). Think of this as like having a bunch of kids, ages 3 to 20 shucking corn. You don't give them all the same number of ears of corn, they each fetch from the basket and shuck as best as they can.

Jim Dempsey

Employee
1,996 Views

There's a series of articles on vectorization starting here. Don't be afraid of the dates. Vectorization has been available in Intel Fortran compilers since 2015.

New Contributor I
1,973 Views

Thank you Barbara and Jim.

Jim: yes, I understand the general motivation regarding openMP use (i.e., having the threads working on equal loads, and avoid having idle threads etc.); my only questions pertain to using OpenMP together with SIMD for vectorization.

"Note, if you intend to OpenMP this, then remove the iel+J and run J for the whole loop but use static scheduling with chunksize of 128 (or whatever you find is best)."

Would it be possible for you to explain how this would be applied to the following loop, which involves operations with 128-element vectors and much larger vectors?

do J = 1,Nvec
A1(J) = dy1(iel+J) * nz(iel+J) - dz1(iel+J) * ny(iel+J)
end do !J

What I cannot understand is how I would "remove the iel+J and run J for the whole loop". Again, dy1, nz, dz1 and ny may have much larger dimension (let's thay 100,000) than A1 (whose dimension is fixed and equal to 128). So, if I remove the "iel+J" and try to run for the whole loop, how will the compiler know which specific (contiguous) block of 128 elements of the large vectors it has to use for finding the elements of A1()? I need to add/emphasize that A1() is an "intermediate" vector, which is used in subsequent computations in the routine.

I could define an A1() with the same dimension as the "large vectors" (at which case I can apply everything you said), but I would imagine that this will significantly increase the memory use, correct? The dimension of the "large" vectors may become as large as 100,000,000 (this is the "most exotic" upper-bound for the applications that I have in mind). So, if I need to define a temporary vector with 100 million values for this routine, I would require something close to 1GB of memory just for this vector. If on top of that we add the fact that I actually have 6 vectors A1(), A2(), ..., A6() which are used in subsequent computations.

Again, many thanks for all the help.

Honored Contributor III
1,944 Views

>>how I would "remove the iel+J and run J for the whole loop".

Well, you haven't shown the whole loop have you?

Something like:

``````do iel = 0, nThingies, 128
(code)
do J = 1,Nvec
A1(J) = dy1(iel+J) * nz(iel+J) - dz1(iel+J) * ny(iel+J)
end do !J
(code)
end do ! iel``````

The above is mere speculation.

You might want to consider:

``````integer :: NouterVec, Nvec ! read in somewhere
real, allocatable, dimension(:,:) :: dy1, nz, dz12, ny ! (Nvec,NouterVec)
real, allocatable, dimension(:) :: A
... ! read in array sizes
allocate(dy1(Nvec,NouterVec))
allocate(nz(Nvec,NouterVec))
allocate(dz12(Nvec,NouterVec))
allocate(ny(Nvec,NouterVec))
allocate(A(Nvec))
do i=1,NouterVec
(code)
A = dy1(:,i) * nz(:,i) - dz(:,i) * nz(:,i) ! fill A(1:Nvec)
(code)
end do``````

>> The dimension of the "large" vectors may become as large as 100,000,000 ...

>> If on top of that we add the fact that I actually have 6 vectors A1(), A2(), ..., A6()

So the real story is you have 100,000,000 or so collections of data points, each collection has 128 data points, you wish to investigate some interactions amongst six of the collections, which may be: a) together,  b) disbursed, c) permuted within the connection of collections.

``````... ! allocate, read in
... outer control loop defining I1, I2, I3, I4, I5, I6
A1 = dy1(:,I1) * nz(:,I1) - dz(:,I1) * nz(:,I1)
...
A6 = dy1(:,I6) * nz(:,I6) - dz(:,I6) * nz(:,I6)
... manipulate A1 through A6
... end outer loop``````

Please note with respect to vectors.

There are two vectors to be aware of:

a) Mathmatical vector (a list of properties)

b) The CPU core's SIMD vector (a small fixed set of like values to be computed with the same instructions).

The nature of your problem will define the Mathmatical vector. Your platform (CPU ISA) will define its SIMD vector. You, as the programmer, have to decide as to if you want to aid the compiler in improving performance through one or more of:

o data alignment (to cache line and/or page boundry)

o padding mathmatical vectors such that adjacent mathmatical vectors align to cache line

o tiling (partitioning work, e.g. A1, ... A6) to improve cach hit ratios

The usual art of programming.

Jim Dempsey

New Contributor I
1,933 Views

Thanks Jim.

Yes, I agree I did not share my code, but what had happened was that I was confused by the statement "remove the iel+J and run J" that you had mentioned, with regards to the SPECIFIC loop I sent you.

I simply thought that if we have the loop in question:

do J = 1,Nvec

A1(J) = dy1(iel+J) * nz(iel+J) - dz1(iel+J) * ny(iel+J)

end do !J

then the compiler would be able to understand that:

i) iel is a FIXED number every time this loop is running (no matter whether iel is set up by an external loop like the one you speculated or even a DO WHILE... loop),

ii) the elements of the "large arrays" dy1(), nz(), dz1() and ny() used in the specific loop are all contiguously stored in memory (I will henceforth try to use the terminology "arrays" to avoid confusion with what we mean by SIMD vector),

iii) the loop can be safely and "aggressively" vectorized.

At any rate, I think I will be able to take it from here and look into the problem. I agree with all your statements regarding general programming principles, and I certainly was not asking for a "general recipe which will be optimal for any possible problem at hand". My questions were aligned towards ensuring that I get the most out of what the compiler can automatically do in terms of vectorization.

Again, thanks for all the help.

Honored Contributor III
1,927 Views

FWIW, the newer CPUs "generally" perform unaligned loads as fast as aligned loads when data is sequentially read as in your Nvec loop. You might expect a couple of clock ticks at the start of the loop if that unaligned load is loading data across two cache lines.

Also, while the vectorization reports are good starting point, using VTune will expose any issues, as well as showing you where to focus your efforts.

Good luck in your performance optimizations.

Jim Dempsey

New Contributor I
1,778 Views
Jim and Ron,

I tried to follow your suggestions; however, my compiler still mentions lack of alignment.

I attach my code here (basically, a module where I define all the "larger" arrays, and then a routine which uses these larger arrays, working in pieces of 128 contiguous elements each).

!*************************************************************************************************  begin code

module modulevar1
save

real*8, allocatable, dimension(:):: nx,ny,nz
real*8, allocatable, dimension(:):: dx1,dx2,dy1,dy2,dz1,dz2
real*8, allocatable, dimension(:):: sN
integer dim1 ! dim1 is the dimension of the large arrays, which is initialized elsewhere in my program (not shown here)

!dir\$ attributes align:64 :: nx
!dir\$ attributes align:64 :: ny
!dir\$ attributes align:64 :: nz
!dir\$ attributes align:64 :: sN
!dir\$ attributes align:64 :: dx1
!dir\$ attributes align:64 :: dy1
!dir\$ attributes align:64 :: dz1
!dir\$ attributes align:64 :: dx2
!dir\$ attributes align:64 :: dy2
!dir\$ attributes align:64 :: dz2

end module modulevar1 !==========================================================

subroutine sub1
use modulevar1
implicit none

integer, parameter:: Nvec1 = 128

real*8 A1(Nvec1),A2(Nvec1),A3(Nvec1)
real*8 A4(Nvec1),A5(Nvec1),A6(Nvec1)

real*8 f(Nvec1,12)

!dir\$ attributes align:64 :: A1
!dir\$ attributes align:64 :: A2
!dir\$ attributes align:64 :: A3
!dir\$ attributes align:64 :: A4
!dir\$ attributes align:64 :: A5
!dir\$ attributes align:64 :: A6
!dir\$ attributes align:64 :: f

integer iel, icounter, igp, i1, i2, J, Nvec, ithr1, iel1, in1, k1
!DIR\$ ASSUME_ALIGNED A1: 64
!DIR\$ ASSUME_ALIGNED A2: 64
!DIR\$ ASSUME_ALIGNED A3: 64
!DIR\$ ASSUME_ALIGNED A4: 64
!DIR\$ ASSUME_ALIGNED A5: 64
!DIR\$ ASSUME_ALIGNED A6: 64
!DIR\$ ASSUME_ALIGNED nx: 64
!DIR\$ ASSUME_ALIGNED ny: 64
!DIR\$ ASSUME_ALIGNED nz: 64
!DIR\$ ASSUME_ALIGNED sN: 64
!DIR\$ ASSUME_ALIGNED dx1: 64
!DIR\$ ASSUME_ALIGNED dy1: 64
!DIR\$ ASSUME_ALIGNED dz1: 64
!DIR\$ ASSUME_ALIGNED dx2: 64
!DIR\$ ASSUME_ALIGNED dy2: 64
!DIR\$ ASSUME_ALIGNED dz2: 64
!DIR\$ ASSUME_ALIGNED f: 64

!DIR\$ ASSUME (mod(iel,Nvec1) .eq. 0)

iel = 0
do while(iel.lt.dim1)
Nvec = sim1 - iel
if(Nvec.gt.Nvec1) Nvec = Nvec1

!\$OMP PARALLEL DO SIMD
do J = 1,Nvec
A1(J) = dy1(iel+J)*nz(iel+J)-dz1(iel+J)*ny(iel+J)
A2(J) = dz1(iel+J)*nx(iel+J)-dx1(iel+J)*nz(iel+J)
A3(J) = dx1(iel+J)*ny(iel+J)-dy1(iel+J)*nx(iel+J)
A4(J) = dy2(iel+J)*nz(iel+J)-dz2(iel+J)*ny(iel+J)
A5(J) = dz2(iel+J)*nx(iel+J)-dx2(iel+J)*nz(iel+J)
A6(J) = dx2(iel+J)*ny(iel+J)-dy2(iel+J)*nx(iel+J)
end do !J
!\$OMP END PARALLEL DO SIMD

f = 0.d0

!\$OMP PARALLEL DO SIMD
do J = 1,Nvec
f(J,1) = f(J,1) - nx(iel+J)*SN(iel+J)
f(J,2) = f(J,2) - ny(iel+J)*SN(iel+J)
f(J,3) = f(J,3) - nz(iel+J)*SN(iel+J)
f(J,4) = f(J,4) - A1(J)*SN(iel+J)
f(J,5) = f(J,5) - A2(J)*SN(iel+J)
f(J,6) = f(J,6) - A3(J)*SN(iel+J)
f(J,7) = f(J,7) + nx(iel+J)*SN(iel+J)
f(J,8) = f(J,8) + ny(iel+J)*SN(iel+J)
f(J,9) = f(J,9) + nz(iel+J)*SN(iel+J)
f(J,10) = f(J,10) + A4(J)*SN(iel+J)
f(J,11) = f(J,11) + A5(J)*SN(iel+J)
f(J,12) = f(J,12) + A6(J)*SN(iel+J)
end do !J
!\$OMP END PARALLEL DO SIMD

iel = iel + Nvec

end do !do while

return
end subroutine sub1

!****************************************************************************************************** end of code

When I build my application with the diagnostic options that Ron suggested, I see that all the "smaller arrays" (the ones which are defined locally in my routine and have a dimension of Nvec1 = 128) are indeed mentioned as aligned.
For the "larger arrays" (the ones mentioned through components (ien+J) in my loops) I get messages like:

15389: vectorization support: reference DY1(IEL+J) has unaligned access

(similar messages are printed for all other "larger arrays" that I am trying to use in my routine, such as dx1(), dx2(), nx(), ny(), ... - the compiler diagnostics mention that all the locally defined arrays in the routine, including the 2D array f(:,:), have aligned access).

In light of the above, I wanted to ask whether I have a mistake somewhere in my code, which prevents the compiler from recognizing aligned access for the larger arrays.

Moderator
1,900 Views

you should try ifort and these options to get an optimization report:

-qopt-report=5 -qopt-report-phase=vec

then look for the <file>.optrpt file.  It will tell you what the compiler thinks of your loops and vectorization.

New Contributor I
1,781 Views

Jim, Ron,

I tried to implement all your suggestions, but my compiler still complains for lack of alignment. I provide my detailed code (which contains a module with definition of the "large arrays", and then a subroutine wherein I pursue vectorization:

!*************************************************************************************************************************************

module modulevar1                       !=========================================================

save

real*8, allocatable, dimension(:):: nx,ny,nz
real*8, allocatable, dimension(:):: dx1,dx2,dy1,dy2,dz1,dz2
real*8, allocatable, dimension(:):: sN

integer dim1       ! this is the dimension of the large arrays

!dir\$ attributes align:64 :: nx
!dir\$ attributes align:64 :: ny
!dir\$ attributes align:64 :: nz
!dir\$ attributes align:64 :: sN
!dir\$ attributes align:64 :: dx1
!dir\$ attributes align:64 :: dy1
!dir\$ attributes align:64 :: dz1
!dir\$ attributes align:64 :: dx2
!dir\$ attributes align:64 :: dy2
!dir\$ attributes align:64 :: dz2

end module modulevar1                !===================================================================

subroutine sub1                 !!=============================================== this is the routine

use modulevar1

implicit none

integer, parameter:: Nvec1 = 128

real*8 A1(Nvec1),A2(Nvec1),A3(Nvec1)
real*8 A4(Nvec1),A5(Nvec1),A6(Nvec1)

real*8 f(Nvec1,12)

!dir\$ attributes align:64 :: A1
!dir\$ attributes align:64 :: A2
!dir\$ attributes align:64 :: A3
!dir\$ attributes align:64 :: A4
!dir\$ attributes align:64 :: A5
!dir\$ attributes align:64 :: A6
!dir\$ attributes align:64 :: f

integer iel, icounter, igp, i1, i2, J, Nvec, ithr1, iel1, in1, k1

!DIR\$ ASSUME_ALIGNED A1: 64
!DIR\$ ASSUME_ALIGNED A2: 64
!DIR\$ ASSUME_ALIGNED A3: 64
!DIR\$ ASSUME_ALIGNED A4: 64
!DIR\$ ASSUME_ALIGNED A5: 64
!DIR\$ ASSUME_ALIGNED A6: 64
!DIR\$ ASSUME_ALIGNED nx: 64
!DIR\$ ASSUME_ALIGNED ny: 64
!DIR\$ ASSUME_ALIGNED nz: 64
!DIR\$ ASSUME_ALIGNED sN: 64
!DIR\$ ASSUME_ALIGNED dx1: 64
!DIR\$ ASSUME_ALIGNED dy1: 64
!DIR\$ ASSUME_ALIGNED dz1: 64
!DIR\$ ASSUME_ALIGNED dx2: 64
!DIR\$ ASSUME_ALIGNED dy2: 64
!DIR\$ ASSUME_ALIGNED dz2: 64
!DIR\$ ASSUME_ALIGNED f: 64

!DIR\$ ASSUME (mod(iel,Nvec1) .eq. 0)

iel = 0

do while(iel.lt.dim1)     ! note: dim1 is the dimension of the large arrays, i.e. dx1, dy1, nx, etc.
Nvec = dim1 - iel

if(Nvec.gt.Nvec1) Nvec = Nvec1

!\$OMP PARALLEL DO SIMD
do J = 1,Nvec
A1(J) = dy1(iel+J)*nz(iel+J)-dz1(iel+J)*ny(iel+J)
A2(J) = dz1(iel+J)*nx(iel+J)-dx1(iel+J)*nz(iel+J)
A3(J) = dx1(iel+J)*ny(iel+J)-dy1(iel+J)*nx(iel+J)
A4(J) = dy2(iel+J)*nz(iel+J)-dz2(iel+J)*ny(iel+J)
A5(J) = dz2(iel+J)*nx(iel+J)-dx2(iel+J)*nz(iel+J)
A6(J) = dx2(iel+J)*ny(iel+J)-dy2(iel+J)*nx(iel+J)
end do !J
!\$OMP END PARALLEL DO SIMD

f = 0.d0

!\$OMP PARALLEL DO SIMD
do J = 1,Nvec
f(J,1) = f(J,1) - nx(iel+J)*SN(iel+J)
f(J,2) = f(J,2) - ny(iel+J)*SN(iel+J)
f(J,3) = f(J,3) - nz(iel+J)*SN(iel+J)
f(J,4) = f(J,4) - A1(J)*SN(iel+J)
f(J,5) = f(J,5) - A2(J)*SN(iel+J)
f(J,6) = f(J,6) - A3(J)*SN(iel+J)
f(J,7) = f(J,7) + nx(iel+J)*SN(iel+J)
f(J,8) = f(J,8) + ny(iel+J)*SN(iel+J)
f(J,9) = f(J,9) + nz(iel+J)*SN(iel+J)
f(J,10) = f(J,10) + A4(J)*SN(iel+J)
f(J,11) = f(J,11) + A5(J)*SN(iel+J)
f(J,12) = f(J,12) + A6(J)*SN(iel+J)
end do !J
!\$OMP END PARALLEL DO SIMD

iel = iel + Nvec

end do ! do while

return

end subroutine sub1                        !============================= end of routine

!**************************************************************************************************************** end of code

When I run the compiler diagnostics in accordance with Ron's suggestion, I see that all the "small arrays", i.e. the arrays that are defined in the subroutine and have dimension equal to Nvec1 = 128, are mentioned as being aligned in the optimization report. The same applies for the 2-dimensional array f(:,:) (if you notice my code, you will see that each line including f(:,:) is set up so that I access contiguous elements of the specific array).

For all the larger arrays it seems that I do not get alignment; specifically, I get messages like:

15389: vectorization support: reference DY1(IEL+J) has unaligned access

(I obtain similar messages for all other "large arrays which are indexed through (IEL+J), that is, dx1, dx2, dy2, nx, ny, etc.)

Is there anything that I could change in the code that I sent you?

If not, then what I will do is create additional "temporary arrays", which will be taking "slices of the large arrays", and then will be used in the actual loops in lieu of the large arrays. For example, I will have an array nx1(Nvec1) (= an array with 128 components), and inside my do while.... loop, I will be having something like:

do J=1,Nvec1

nx1(J) = nx(iel+J)

end do !J

and then, in all my current loops that involve operations with nx(iel+J), I will simply be using nx1(J) instead (which is bound to be aligned).

Again, thank you for all your help.