ia64 "unaligned access" messages with SMP

burgel · ‎09-15-2006

I'm getting "unaligned access" messages to the console on an Itanium2 system (ifort 9.0 or ifort 9.1.033) from a particular subroutine, but only when compiled with -parallel or -openmp (for hand-coded directives). I get the messages whether running 1 or more threads. (I can get very good speed up with two threads on a different system with comparable memory bandwidth, so I think the code itself is not the issue.) I wonder if there is a performance impact from this message?

The "unaligned access" messages do NOT appear with a normal (unthreaded) compile, which runs the particular subroutine about 10 times faster. (i.e., the -openmp compile is 10-times slower!) The whole code is compiled with "-align all".

The compile line for the particular subroutine is

ifort -openmp -ftz -align all -O3 -I./include -I/usr/local/netcdf/include -c advect.f90

(otherwise it is without the -openmp)

Any ideas?

-- Ted

Steven_L_Intel1 · ‎09-16-2006

There could be a performance impact. I'd ask that you send a buildable and runnable example to Intel Premier Support.

burgel · ‎09-20-2006

Definitely a performance impact. I also tested the code on a new MacPro with the ifort compiler, and it runs fine in parallel, so I guess this is an ia64-specific issue.

I am preparing a simplified code base to submit to Intel Support through our IT person (they'll only let us have one contact, so everything has to go through one person).

burgel · ‎09-26-2006

I figured out part of the problem has to do with indices. For example, this loop gave unaligned access messages:

!$OMP PARALLEL DO DEFAULT(SHARED), PRIVATE(i,j,k,vv,im1,dir)
DO k = kmn,kmx
DO j = jmn,jmx
DO i = imn,imx

vv = 0.5*(u(i,j,k) + u(i-is,j-js,k-ks))
im1 = max(i-1,1)
dir = sign(1.0,vv)

IF( i .ge. 4 .and. i .le. nx-3+is ) THEN

fx(i,j,k) = vv * ( f50 * (s(i, j,k) + s(i-1,j,k)) &
- f51 * (s(i+1,j,k) + s(i-2,j,k)) &
+ f52 * (s(i+2,j,k) + s(i-3,j,k)) &
- f52 * (s(i+2,j,k) - s(i-3,j,k) &
- 5.0 * (s(i+1,j,k) - s(i-2,j,k)) &
+ 10. * (s(i, j,k) - s(i-1,j,k)))*dir )

ELSEIF( i .eq. 3 .or. i .eq. nx-2+is ) THEN

fx(i,j,k) = vv * ( f30 * (s(i,j,k) + s(i-1,j,k)) &
- f31 * (s(i+1,j,k) + s(i-2,j,k)) &
+ f31 * (s(i+1,j,k) - s(i-2,j,k) &
- 3.0 * (s(i,j,k)-s(i-1,j,k)))*dir )

ELSE

fx(i,j,k) = vv * 0.5 * (s(i,j,k) + s(im1,j,k))

ENDIF

ENDDO
ENDDO
ENDDO

ENDIF

But this version does not:

i3 = nx-3+is
i2 = nx-2+is
i1 = nx-1+is
!$OMP PARALLEL DO DEFAULT(SHARED), PRIVATE(i,j,k,vv,im1,dir)
DO k = kmn,kmx
DO j = jmn,jmx
DO i = imn,imx

vv = 0.5*(u(i,j,k) + u(i-is,j-js,k-ks))
im1 = max(i-1,1)
dir = sign(1.0,vv)

IF( i .ge. 4 .and. i .le. i3 ) THEN

fx(i,j,k) = vv * ( f50 * (s(i, j,k) + s(i-1,j,k)) &
- f51 * (s(i+1,j,k) + s(i-2,j,k)) &
+ f52 * (s(i+2,j,k) + s(i-3,j,k)) &
- f52 * (s(i+2,j,k) - s(i-3,j,k) &
- 5.0 * (s(i+1,j,k) - s(i-2,j,k)) &
+ 10. * (s(i, j,k) - s(i-1,j,k)))*dir )

ELSEIF( i .eq. 3 .or. i .eq. i2 ) THEN

fx(i,j,k) = vv * ( f30 * (s(i,j,k) + s(i-1,j,k)) &
- f31 * (s(i+1,j,k) + s(i-2,j,k)) &
+ f31 * (s(i+1,j,k) - s(i-2,j,k) &
- 3.0 * (s(i,j,k)-s(i-1,j,k)))*dir )

ELSEIF( i .eq. 2 .or. i .eq. i1 ) THEN

fx(i,j,k) = vv * 0.5 * (s(i,j,k) + s(im1,j,k))

ELSE

fx(i,j,k) = 0.0

ENDIF

ENDDO
ENDDO
ENDDO

ENDIF

There are some other loops that apparently have additional problems, so I'm still planning to submit code to Intel Premier Support, but I thought I'd take another crack at it first.

I had similar problems with loop limits that had Max or Min functions, like "DO i=1,Min(nx-1+is,ix+1)", where ifort on Itanium couldn't seem to handl e it. But set a temporary value, imx = Min(nx-1+is,ix+1) and then DO i=1,imx and it works. Bizarre!