Quote:Konstantinos - Page 2

KNAnagnostopoulos · ‎05-31-2020

I define a very simple type for a matrix:

 type               MyMatrix
  integer                               :: n=0
  complex(dp), allocatable              :: v(:,:)
 end type           MyMatrix

Then, operations turn out to have a large overhead for small N x N matrices. For example, matrix multiplication (which I am mostly interested in)

type(MyMatrix) ::  m1, m2, m3

complex(dp), allocatable ::    a1(:,:), a2(:,:), a3(:,:)

m3 = m1 * m2;  m3 = MyMatmul(m1,m2); a3 = MyMatmul(a2,a3)

for N=60 gives about 1.6 performance penalty. For the more complex matrix type that I have in my real program, containing also procedure components, the penalty can be up to 2.5! The comparison is done using an interface that is simply passing m1 % v and m2 % v to BLAS's zgemm . The operator(*) is overloaded to MyMatmul (same performance as the function call).

The penalty reduces for large matrices, reaching 10% for N=1000 and 5% for N=4000. In my real program, the penalty remains 10% up to N=16000.

The performance is compared using wall time.

Is this unavoidable?

-------------------------------------------------------------------------------------------------------------

I am attaching the simplified source code that focuses on this particular problem

Benchmarks on i7-4790 CPU @ 3.60GHz, Ubuntu 18.04,

Fortran Intel(R) 64 Compiler for applications running on Intel(R) 64, Version 19.0.5.281 Build 20190815

ifort -O3 -heap-arrays -xAVX  -parallel benchmarkMyMatrix.f90 -mkl=parallel 

export  OMP_NUM_THREADS=8
export  MKL_NUM_THREADS=8

and on the ARIS supercomputer (Intel Xeon E5-2680v2, https://hpc.grnet.gr/hardware/#hardware-overview )

compiler_version: Intel(R) Fortran Intel(R) 64 Compiler for applications running on Intel(R) 64, Version 18.0.0.128 Build 20170811
compiler_options: -I/apps/compilers/intel/18.0.0/mkl/include -c -O3 -heap-arrays -xCORE-AVX-I -parallel -mkl=parallel

export  OMP_NUM_THREADS=20
export  MKL_NUM_THREADS=20

Indicative wall times for each operation: (times are in wall time seconds per matrix multiplication averaged over several thousand random matrices - less statistics for N=4000 - on the i7 mentioned above).

N     MatmulMyMatrix  MatmulArray2  ratio
30    0.63192E-05     0.38982E-05   1.62
60    0.21967E-04     0.13776E-04   1.59
120   0.10932E-03     0.79098E-04   1.38
190   0.36823E-03     0.30644E-03   1.20
250   0.78471E-03     0.65853E-03   1.19
500   0.64193E-02     0.53616E-02   1.20
1000  0.45260E-01     0.41173E-01   1.10
2000  0.34830         0.32136       1.08
4000  2.5440          2.4220        1.05

FortranFan · ‎06-06-2020

Konstantinos Anagnostopoulos wrote:
Some updates:
1. With Fortran 19.1.1.217, I got rid of the segmentation faults for the parameterized derived types I reported above (ifort 19.0.5.281, -O3 -heap-arrays -xHost -parallel -mkl=parallel). The -g option is problematic as before.
2. I made a comparison of parameterized derived types (FortranFan's suggestion) versus ordinary types, allocatable arrays and statically allocated arrays. When using matrix multiplication via a subroutine, there is no noticeable time difference for N>=30.
3. If one uses operator overloading, the simple type is faster than the parameterized type for, more or less, N<=1000.
4. For ordinary allocatable arrays, using a function or subroutine for zgemm is the same, as is for the statically allocated arrays.
5. The intrinsic MATMUL is slower for matrix-matrix multiplication than using zgemm for N < 300. The overhead max ranges from 50% to 80% depending on processor and ifort version (see attached array_benchark.README)
6. Overloading multiplication for allocatable arrays (operator(.mm.), such that m3 = m1 .mm. m2) for arrays puts a less than 10% overhead for N<=500 (desktop) and N<= 2000 (supercomputer)
Maximum overhead is 20% for desktop (N=60) and 40% for supercomputer (N=120) (see attached array_benchark.README)
7. Expected scaling (~ N**3 for matrix-matrix ~N**2 for matrix-vector) sets in for N>=N0, where N0=120-750 for matrix-matrix and N0=1000-2000 for matrix-vector depending on compiler version and processor (see attached array_benchark.README) ..

Thanks much for sharing the results. But there are a couple of aspects you may want to look into:

For all N less than a certain threshold (perhaps round 500?), the results appear not to have anything to do with matrix operations, rather the timing values might be dictated by with your shuffle_array2 routine which is mostly some PRNG invocations and random dispatches. You may want to look into this.
Your case (4) results appear somewhat inconsistent with what you show in Quote #14 e.g., N=500, the relative difference compared to columns (2) and (5) which are what are effectively shown in Quote #14. I wonder if this too has to do with your shuffle_array2 routine. You may want to look into this.

KNAnagnostopoulos · ‎06-06-2020

FortranFan wrote:
Quote:

Thanks much for sharing the results. But there are a couple of aspects you may want to look into:
For all N less than a certain threshold (perhaps round 500?), the results appear not to have anything to do with matrix operations, rather the timing values might be dictated by with your shuffle_array2 routine which is mostly some PRNG invocations and random dispatches. You may want to look into this.
Your case (4) results appear somewhat inconsistent with what you show in Quote #14 e.g., N=500, the relative difference compared to columns (2) and (5) which are what are effectively shown in Quote #14. I wonder if this too has to do with your shuffle_array2 routine. You may want to look into this.

The reason I introduced shuffle_array2 was my fear that a clever compiler would see that (some of) the data is not changing and invalidate the results. I have not been very careful but:

1. The shuffle_array2 routine changes only 5 random elements of the arrays, independently of N. The overhead of the routine random_number is independent of N, so if times depend on N, this is not coming from random_number. The number of changes does not scale with N, but of course, the stride of data changes since we access random elements of the array and that can introduce some N dependence. But I guess that should be less than the effort to compute the matrix multiplication.

But whatever the effects, for a given N, this does not affect the relative times of the methods.

2. The call to the wtime() routine also introduces some bias, since the smallest the matrix, the more often it is called (I fix the time of the calculation, not the number of calls). Do you think this can put a significant overhead? That I can easily check by better writing the code. I have no idea how system_clock performs.

If 2. is significant, then maybe subtracting the time spending in shuffle_array2 will not be a good idea.

I can also update e.g. the first 100 elements of the array via a cheap deterministic procedure (e.g. adding 1 to some elements using the MOD function as a random number generator substitute) or remove it completely if you think that my worries about a clever compiler are too much. A very clever compiler can also in principle discover my cheat of updating a small fixed portion of a large array and reduce the effort of the calculation artificially.

FortranFan · ‎06-07-2020

Konstantinos Anagnostopoulos wrote:
The reason I introduced shuffle_array2 was my fear that a clever compiler would see that (some of) the data is not changing and invalidate the results. ..
Do you think (wtime) can put a significant overhead? ..

In your post in Quote #14, you posted results for N=500 with a simple example. Will it be possible for you to run the example from Quote #14 with a couple of smaller values of N, say 250 and 120, on the same platform where you ran your cases in Quote #21 and compare with columns (2), (4), and (5) that you show in Quote #21?

I personally don't think 'wtime' introduces an overhead in this context, but other readers may have better input on this.

Thanks,

Performance penalty for types (type matrix versus arrays)