topic GEMM and First Touch Memory Allocation in Intel® Moderncode for Parallel Architectures

GEMM and First Touch Memory Allocation

Groth__Brandon — Mon, 22 Jan 2018 18:48:48 GMT

I am doing some experiments with various implementations that compute the GEMM algorithm C <- alpha*AB + beta*C. One topic that did come up is if using memory locality had any impact in performance on a Sandy Bridge Xeon node. The following is one implementation of GEMM (I have many others, including MKL's SGEMM(...)):

call RANDOM_NUMBER(A)
call RANDOM_NUMBER(B)
call RANDOM_NUMBER(C)

!$OMP PARALLEL DO SHARED(A,B,C) PRIVATE(i,j,l,sum)
do j = 1, n
  do i = 1, m
    sum = 0
    do l = 1, k
      sum = sum + A(i,l)*B(l,j)
    enddo
    C(i,j) = alpha*sum + beta*C(i,j)
  enddo
enddo
!$OMP END PARALLEL DO

This code tries to use stride-1 memory when possible, but it clearly isn't possible for matrix A. Since the random_number(:) is called outside a parallel region, all 3 matrices are located on thread 0. This should make memory access quite expensive, especially when using threads from a different socket (ie threads 8-15 when using 16 threads).

Some observations about how the loop works:

1) Matrix A is needed by every thread in full, as it has no dependence on loop J (ie the loop being distributed to all threads)

2) Only columns of B and C are needed on each thread

A potential solution using First Touch using static scheduling option could be this:

!$OMP PARALLEL SHARED(A,B,C,m,n) PRIVATE(i,j)

! Matrix A is needed on every thread, so we spread out rows of A evenly to all threads
!$OMP DO
do i = 1, m
    call RANDOM_NUMBER( A(i,:) )
end do
!$OMP END DO

! Only columns of B and C are needed on each thread
!$OMP DO
do j = 1, n
  call RANDOM_NUMBER( B(:,j) )
  call RANDOM_NUMBER( C(:,j) ) 
end do
!$OMP END DO

!$OMP END PARALLEL

... Continue algorithm ...

Giving the threads near perfect access of 2/3 matrices hardly had any performance impact. I observed a 1-2% difference in execution times, which isn't conclusive that the First Touch allocation is doing anything. Even when we tried the same thing with a pre-transposed matrix A to get stride-1 access when computing the matrix multiply like:

Atr = TRANSPOSE(A) ! now use Atr(l, i) instead of A(i,l) in the do loop

we got similar percentage differences in run times. Experiments were attempted on square matrices A,B,C with the size of 5000x5000.

Does anyone have any insight as to what can be done?

You're probably getting

TimP — Mon, 22 Jan 2018 23:13:57 GMT

You're probably getting fairly effective use of cache, so it's not surprising that first touch doesn't have much effect. It may be difficult to get more than 50% of peak performance with straightforward source code; first touch will have less relative effect if you don't use an optimized library version. I guess you're saying it's clear you aren't using a stride 1 inner loop, but normal loop interchanges allow that to happen, so to me it's clear you aren't preventing stride 1 unless you set compiler options for that purpose.

As blocking strategies depend on matrix size, even an expert can't guess what you should do without that information. Even MKL now expects you to choose a different implementation for small matrices from the one for medium and large sizes which chooses blocking strategies according to problem size.

The "modern code" strategy expects you to use appropriate libraries rather than spend your time reinventing wheels.

I have not seen much impact

McCalpinJohn — Tue, 23 Jan 2018 23:45:32 GMT

I have not seen much impact of poor first-touch affinity on Sandy Bridge, but have definitely seen it on Haswell and Skylake Xeon.

On Linux, I have found the following command to be adequate to obtain good scaling on Haswell and Skylake Xeon:

numactl --interleave=0,1 ./dgemm_test 5000 5000 5000