Solved: Re: IFX vs IFORT allocatable array alignment difference

Ilia_Ozornov · ‎09-06-2024

Hi! I am learning Fortran to replace performance critical part of python code. When experimenting with array operations, I noticed that IFX and IFORT have very different behaviour with regards to alignment and I cannot always force IFX to assume alignment. I would be happy to get an advice on how to proceed: shall I switch to IFORT for now or are there other alignment options available? The code will be used in a well controlled environment, so I think, that I can ensure that data will be aligned.

I used following commands for compilation with "Intel(R) Fortran Compiler for applications running on Intel(R) 64, Version 2024.0.2 Build 20231213":

ifort /fast /align:array64byte /QxCOMMON-AVX512 /Qopt-report=3 example.f90

ifx /fast /align:array64byte /QxCOMMON-AVX512 /Qopt-report=3 example.f90

Here is a simple program which I used for testing:

program example
    implicit none

    integer, parameter :: dp = selected_real_kind(15)
    integer :: n,i
	real(dp), allocatable ::  a(:)
    
	write(*,*) "enter number of elements in array a"
    read(*,*) n              
	
	allocate(a(1:n))	
	call random_number(a(1:size(a)))
	
	print *, LOC(a), MOD(LOC(a), 64)  ! Check if the address of a is aligned
		
	do i=1,size(a)
		a(i) = exp(a(i))
	end do
	a = a + 1.0
	a = log(a)
	
end program example

IFORT produces only aligned operations:

   remark #15448: unmasked aligned unit stride loads: 1 
   remark #15449: unmasked aligned unit stride stores: 1

and IFX does not:

    remark #15450: unmasked unaligned unit stride loads: 1 
    remark #15451: unmasked unaligned unit stride stores: 1

I then tried to change IFX behavior with additional hints

program example
    implicit none

    integer, parameter :: dp = selected_real_kind(15)
    integer :: n,i
	real(dp), allocatable ::  a(:)
    
	write(*,*) "enter number of elements in array a"
    read(*,*) n              
	
	allocate(a(1:n))
	!DIR$ VECTOR ALIGNED
	!DIR$ ASSUME_ALIGNED a:64	
	call random_number(a(1:size(a)))
	
	print *, LOC(a), MOD(LOC(a), 64)  ! Check if the address of a is aligned
		
	!DIR$ VECTOR ALIGNED
	!DIR$ ASSUME_ALIGNED a:64
	do i=1,size(a)
		a(i) = exp(a(i))
	end do	
	!DIR$ VECTOR ALIGNED
	!DIR$ ASSUME_ALIGNED a:64
	a = a + 1.0
	!DIR$ VECTOR ALIGNED
	!DIR$ ASSUME_ALIGNED a:64
	a = log(a)
	
end program example

This helped for random number generation and explicit do-loop, but addition of a number to array or log(a) remained unaligned operations.

Ron_Green · ‎09-06-2024

We have always recommended that alignment is a 2-step process:

1) Align the data by using -align arrayXXbyte for the compilation. This is needed for the compilation units (source files) where the allocations for your arrays occurs. Generally you want to use this option on all source files, as it allows alignment of automatic arrays, array temporaries, etc. I apply it to all sources.

2) Alert the compiler at the sites using those arrays. This is necessary since the use sites may be in different compilation units from the allocation units. that is, the use of the arrays may be in a separate source file from where the arrays were declared and the compiler has no way to know that the allocations are or are not aligned.

For the second step you are using our proprietary directives. Your code will be more future-proof if you use the OMP syntax instead:

!$omp simd aligned (a:64)

Now that said, it's important to remember a few things. First, although ifort and ifx use the same front end to parse your program, after the internal representation of your program (an AST) is created, it is passed off to the optimizers and code generators. These are completely different for ifx and ifort. So you should not expect any of the vectorization or optimizations to match.
I noticed also that you do a lot of operations on array 'a' but never reused the results. ifx uses the LLVM optimizer. And it has a very aggressive dead code elimination step. So since you never print or reuse 'a' after all these expressions it may decide that your whole program is a NO-OP and eliminate it! Add a 'print *, a' to the end of the program to prevent this.

as for the code generator choosing aligned vs unaligned accesses - that is entirely at the discretion of the optimizer. It is using 'unit stride' accesses. For modern processors using stride 1 there is little to no difference from aligned vs unaligned loads/stores. Could be the code gen decided "no difference" and just used unaligned.

Next, you have

-xCORE-AVX512. You know that is just a SUGGESTION to both compiler optimizers, right? it's not an imperative. Did you notice this:

remark #15305: vectorization support: vector length 4

this shows that although you SUGGESTED AVX512, it is using AVX2 instead. In many cases the overhead of loading larger vector registers is not profitable for low trip count loops. Also, CORE-AVX512 could be an older processor where the power envelope for AVX512 will cause a throttle back in clock speed and hence run slower than AVX2. So the optimizer did what it thought would be better, but thanks for the suggestion that it ignored. If you want to force AVX512, add

-mprefer-vector-width=512

that tells the optimizer that you're serious about wanting AVX512. Now it can still ignore you if it can determine the trip count OR it could multi-version the loop for different AVX levels and make a runtime branch to the appropriate AVX code based on the actual trip count.

Doing this will show the optrpt:

remark #15305: vectorization support: vector length 8

at least for the explicit do loop. but not for this expression

call random_number(a(1:size(a)))

a lot could be going on here. No way for the optimizer to know size(a) at compile time. And it could or could not inline random_number(). Lots of choices on how to handle this. Seems it figured given the complexity of the computation for the random numbers that it decide that AVX512 version of random_number is slower than the AVX2 version in the general case.

size(a) is a runtime call. No way for the compiler to make any predictions on the size of A. You note it creates Remainder loops in all cases since it has no way to know if size(a) is 1 or 1000000000. It will branch directly to the remainder loop if size(a) is less than 1 vector length. If size(a) is large enough, the kernel loop will run in chunks of vector length. If any remaining elements are left, then the remainder is entered. Using compile-time constants for iteration count solves this, BUT really this is very very bad coding to hard-code parameters for the array sizes. But running 1=1,N where N is defined as a parameter can't be beat for efficiency. <sigh> tradeoffs.

What this really comes down to is optimization is complex and often unpredictable at the high level. Also, heuristics for optimizations can and do change over time. the aligned vs unaligned choice for example: in the older processors it made a huge difference. In today's process for unit stride no diff, doesn't matter. in the past AVX2 often ran faster than AVX512. In today's processors the AVX512 overheads have been minimized as much as possible. For this reason, if you KNOW the target architecture, use it instead of CORE-AVX512. Like -xSAPPHIRERAPIDS -mprefer-vector-width=512. Knowing the exact target architecture gives the optimizer/vectorizer/code gen do the best for the target. CORE-AVX512 is generic and applies to older processors so it'll take the safest subset of instructions and use worse-case performance heuristics.

unfortunately there is no easy "cookbook" for all of this. There are interplays with alignment, vectorization, target processor instruction set extensions, cost models for loads/stores and cache architectures, speed stepping differences, etc.

View solution in original post

andrew_4619 · ‎09-06-2024

Your code confuses me. After the allocate of "a" I would not expect the "starting boundary" of a to change, are you saying it does?

Ilia_Ozornov · ‎09-06-2024

Hi Andrew! Maybe I did not formulate it well enough. The allocated array is indeed correctly aligned to 64 byte boundary as far as I can see from my LOC check, but the IFX compiler does vectorization as if array is not aligned, while IFORT assumes that it is (and I prefer this behavior).

andrew_4619 · ‎09-06-2024

Ah! OK, get it thanks. I think IFX is lagging IFORT on code optimisations, getting it working is first base which is getting there.

I guess it is for intel staff to comment on your issue.

Ron_Green · ‎09-06-2024

We have always recommended that alignment is a 2-step process:

1) Align the data by using -align arrayXXbyte for the compilation. This is needed for the compilation units (source files) where the allocations for your arrays occurs. Generally you want to use this option on all source files, as it allows alignment of automatic arrays, array temporaries, etc. I apply it to all sources.

2) Alert the compiler at the sites using those arrays. This is necessary since the use sites may be in different compilation units from the allocation units. that is, the use of the arrays may be in a separate source file from where the arrays were declared and the compiler has no way to know that the allocations are or are not aligned.

For the second step you are using our proprietary directives. Your code will be more future-proof if you use the OMP syntax instead:

!$omp simd aligned (a:64)

Now that said, it's important to remember a few things. First, although ifort and ifx use the same front end to parse your program, after the internal representation of your program (an AST) is created, it is passed off to the optimizers and code generators. These are completely different for ifx and ifort. So you should not expect any of the vectorization or optimizations to match.
I noticed also that you do a lot of operations on array 'a' but never reused the results. ifx uses the LLVM optimizer. And it has a very aggressive dead code elimination step. So since you never print or reuse 'a' after all these expressions it may decide that your whole program is a NO-OP and eliminate it! Add a 'print *, a' to the end of the program to prevent this.

as for the code generator choosing aligned vs unaligned accesses - that is entirely at the discretion of the optimizer. It is using 'unit stride' accesses. For modern processors using stride 1 there is little to no difference from aligned vs unaligned loads/stores. Could be the code gen decided "no difference" and just used unaligned.

Next, you have

-xCORE-AVX512. You know that is just a SUGGESTION to both compiler optimizers, right? it's not an imperative. Did you notice this:

remark #15305: vectorization support: vector length 4

this shows that although you SUGGESTED AVX512, it is using AVX2 instead. In many cases the overhead of loading larger vector registers is not profitable for low trip count loops. Also, CORE-AVX512 could be an older processor where the power envelope for AVX512 will cause a throttle back in clock speed and hence run slower than AVX2. So the optimizer did what it thought would be better, but thanks for the suggestion that it ignored. If you want to force AVX512, add

-mprefer-vector-width=512

that tells the optimizer that you're serious about wanting AVX512. Now it can still ignore you if it can determine the trip count OR it could multi-version the loop for different AVX levels and make a runtime branch to the appropriate AVX code based on the actual trip count.

Doing this will show the optrpt:

remark #15305: vectorization support: vector length 8

at least for the explicit do loop. but not for this expression

call random_number(a(1:size(a)))

a lot could be going on here. No way for the optimizer to know size(a) at compile time. And it could or could not inline random_number(). Lots of choices on how to handle this. Seems it figured given the complexity of the computation for the random numbers that it decide that AVX512 version of random_number is slower than the AVX2 version in the general case.

size(a) is a runtime call. No way for the compiler to make any predictions on the size of A. You note it creates Remainder loops in all cases since it has no way to know if size(a) is 1 or 1000000000. It will branch directly to the remainder loop if size(a) is less than 1 vector length. If size(a) is large enough, the kernel loop will run in chunks of vector length. If any remaining elements are left, then the remainder is entered. Using compile-time constants for iteration count solves this, BUT really this is very very bad coding to hard-code parameters for the array sizes. But running 1=1,N where N is defined as a parameter can't be beat for efficiency. <sigh> tradeoffs.

What this really comes down to is optimization is complex and often unpredictable at the high level. Also, heuristics for optimizations can and do change over time. the aligned vs unaligned choice for example: in the older processors it made a huge difference. In today's process for unit stride no diff, doesn't matter. in the past AVX2 often ran faster than AVX512. In today's processors the AVX512 overheads have been minimized as much as possible. For this reason, if you KNOW the target architecture, use it instead of CORE-AVX512. Like -xSAPPHIRERAPIDS -mprefer-vector-width=512. Knowing the exact target architecture gives the optimizer/vectorizer/code gen do the best for the target. CORE-AVX512 is generic and applies to older processors so it'll take the safest subset of instructions and use worse-case performance heuristics.

unfortunately there is no easy "cookbook" for all of this. There are interplays with alignment, vectorization, target processor instruction set extensions, cost models for loads/stores and cache architectures, speed stepping differences, etc.

Ilia_Ozornov · ‎09-07-2024

Hi Ron! Thank you for such comprehensive answer!

What do you think of using !DIR$ VECTOR ALIGNED instead of !$OMP SIMD ALIGNED (a:64) ?

With regards to vector length 4 vs 8, I guess there is no way to know which one will be better for me without testing the real application for speed.

IFX vs IFORT allocatable array alignment difference

Compile Error

Fortran Language