Optimized code and intrinsic functions for dense matrix/vector multiplication in Fortran

Ioannis_K_ · ‎05-23-2024

Hello,

I have a dense m by n matrix A, and two vectors, x and y (with dimensions n and m, respectively).

The value of m is between 10K and 10 Million, while the value of n is between 100 and 1000.

I want to compute the products A*x and transpose(A)*y

What would be the best way to code this in Intel Fortran? Specifically, I have the following questions:

I was thinking that the multiplication A*x must be something like this:
z = 0 ! z is an m-component vector
do ii = 1,n
z(:) = z(:) + A(:,ii)*x(ii)
end do !ii

Is this the best way to program the do loop, and ensure that it will be vectorized?
Also, can I use an OMP PARALLEL DO loop for the multiplication? If yes, does invoking OMP PARALLEL DO have any effect on what is the best way to program the loop? .
What is the best way to program transpose(A)*y?
Are there any MKL intrinsic functions that I could use for the above operations, so that I am sure that the multiplication is as optimized (in terms of speed) as possible?

Many thanks in advance for the help.

jimdempseyatthecove · ‎05-23-2024

How big is z(:)?

That statement should vectorize.

However, if z(:), The expression to the right may create a temporary, generally on stack, and if z(:) has size of 10 million, you may experience a stack overflow.

If stack overflow occurs, create an inner loop

do jj=1,size(z)
  z(jj) = z(jj) + A(jj,ii) * x(ii)
end do

This should also vectorize. Also consider using OpenMP on the do ii loop.

MKL is very good for matrix operations (matmul, transpose, etc), but for simple operations such as the loop above, it may be better to simply code it in line.

Be aware of issues of oversubscription of OpenMP (or other) threaded applications together with the threaded MKL library.

Jim Dempsey

Ioannis_K_ · ‎05-23-2024

Jim,

z(:) has the same size as the number of rows in A(:,:). So yes, a worst-case scenario would have 10 million elements in z.

I intend to have z(:) as a module variable with the SAVE statement (as I need it in other parts of my code). The coefficient matrix A(:,:) and the vector x(:) are also module variables. That is, they are not temporary variables in the routine.

Since you mentioned, I have benchmarked my code for cases where I use the stack or store everything in the heap (through the "-heap-arrays 0" option), and I did not see any difference in speed.

The only question that I have remaining is:

What is the optimal way to code the multiplication transpose(A)*y?

I store this result in a local vector variable v() (the maximum number of components in v() is 1000, so I do not imagine there would be any problem with stack overflow. My obvious approach would be to have a "transpose version" of my algorithm for A*x:

v = 0 ! v is an n-component vector

do ii = 1,m

v(:) = v(:) + A(ii,:)*y(ii)

end do !ii

Will this be optimal? My main concern is that the above loop relies on a linear combination of the rows of A(:,:), and the components of a given row are not in neighboring memory locations....

jimdempseyatthecove · ‎05-24-2024

>>I intend to have z(:) as a module variable with the SAVE statement

Module variables are persistent and do not need to be attributed with SAVE.

RE: inline transpose (swapping indices)

If you only reference transpose(A) once, or very few times, this should be more efficient.

Note, if the target CPU will have the instruction set containing SIMD scatter/gather

AVX2 with FMA

AVX-512PF

AVX-512F (all of the recent CPUs supporting AVX512 include the F variant)

AVX-512VL

then either set the compiler optimization for either:

a) only supporting CPU with scatter/gather (-x<code> or /Qx<code>, -march=<processor> or /arch=<processor>)

b) alternate path with CPU with scatter/gather (-ax<code> or /Qax<code>)

Also, you may need to additionally coax in the gather simd instruction with !$omp simd/!$omp end simd directives.

Verify with VTune disassembly.

Jim