Performance questions under windows using 3 differents architecture

François-Xavier · ‎10-12-2010

Dear Intel Fortran Forum,

I don't know if this is the best place to post this reflexion/question. If not, don't hesitate to mention it.

-> I have downloaded the following test code (matrix multiplication):

[bash]C******************************************************************************
C FILE: omp_mm.f
C DESCRIPTION:  
C   OpenMp Example - Matrix Multiply - Fortran Version 
C   Demonstrates a matrix multiply using OpenMP. Threads share row iterations
C   according to a predefined chunk size.
C AUTHOR: Blaise Barney
C LAST REVISED: 1/5/04 Blaise Barney
C******************************************************************************

      PROGRAM MATMULT
      
      USE OMP_LIB

      INTEGER  NRA, NCA, NCB, TID, NTHREADS, I, J, K, CHUNK
C     number of rows in matrix A 
      PARAMETER (NRA=1000)
C     number of columns in matrix A
      PARAMETER (NCA=1000)
C     number of columns in matrix B
      PARAMETER (NCB=10000)

      REAL*8 A(NRA,NCA), B(NCA,NCB), C(NRA,NCB)
      REAL*8 YSTART, YSTOP, YINIT, YMULT
C
C     USER DIALOG
C
      WRITE(*,*) 'Number of threads? '
      READ(5,*) nthreads
      CALL omp_set_num_threads(nthreads)
C
      YSTART = omp_get_wtime()
C
C     Set loop iteration chunk size 
      CHUNK = 10
C
C     Spawn a parallel region explicitly scoping all variables
!$OMP PARALLEL SHARED(A,B,C,NTHREADS,CHUNK) PRIVATE(TID,I,J,K)
      TID = OMP_GET_THREAD_NUM()
      IF (TID .EQ. 0) THEN
        NTHREADS = OMP_GET_NUM_THREADS()
        PRINT *, 'Starting matrix multiple example with', NTHREADS,
     +           'threads'
        PRINT *, 'Initializing matrices'
      END IF

C     Initialize matrices
!$OMP DO SCHEDULE(STATIC, CHUNK)
      DO 30 I=1, NRA
        DO 30 J=1, NCA
          A(I,J) = (I-1)+(J-1)
  30  CONTINUE
!$OMP DO SCHEDULE(STATIC, CHUNK)
      DO 40 I=1, NCA
        DO 40 J=1, NCB
          B(I,J) = (I-1)*(J-1)
  40  CONTINUE
!$OMP DO SCHEDULE(STATIC, CHUNK)
      DO 50 I=1, NRA
        DO 50 J=1, NCB
          C(I,J) = 0
  50  CONTINUE
  
      YINIT= omp_get_wtime()

C     Do matrix multiply sharing iterations on outer loop
C     Display who does which iterations for demonstration purposes
!$OMP DO SCHEDULE(STATIC, CHUNK)
      DO 60 I=1, NRA
        DO 60 J=1, NCB
          DO 60 K=1, NCA
            C(I,J) = C(I,J) + A(I,K) * B(K,J)
  60  CONTINUE
  
      YMULT= omp_get_wtime()

C     End of parallel region 
!$OMP END PARALLEL
C
      YSTOP = omp_get_wtime()
C     Print results
      PRINT *, '******************************************************'
      PRINT *, 'Performances:'
      PRINT *, 'TOTAL:',YSTOP-YSTART,'sec.'
      PRINT *, 'INIT:', YINIT-YSTART,'sec.'
      PRINT *, 'PROC:', YMULT-YINIT,'sec.'
      PRINT *, '******************************************************'
      PRINT *, 'Done.'

      END



[/bash]

-> I have the program ran on 1thread on 3 differents machines - compiler Intel 11 - WINDOWS HPC server 2008:

1: DELL latitude E6500 / processor P8400(2 cores)(2.26Ghz)
2: HP BL460C G6/ 2x Intel core i7 E5530(4 cores)(2.4Ghz)
3: HP BL685C G7 / 4x Magnycours 6130 (12cores) AMD (2.2Ghz)

-> Performances results are (compiled in/ox - default flags)

1: (Dell) 62.9 sec
2: (BL460C) 97 sec
3: (BL685C) 162 sec

-> I have corrected the loop (60) by inverting I (NRA) and J (NRB)loops

[bash]      DO 60 J=1, NCB
        DO 60 I=1, NRA
          DO 60 K=1, NCA
            C(I,J) = C(I,J) + A(I,K) * B(K,J)
  60  CONTINUE[/bash]

-> Performances results are

1: (BL460C) 8 sec
2: (Dell) 17 sec
3: (BL685C) 17 sec

These are my questions:

1: Before correction, how can i explain these differences of performances. Is this cache or memory issues?

2: After correction, how can i explain the big gap (x2)between AMD magnycours and INTEL Nehalem, there is 200Mhz difference.

I am not doing it in my lost time, this is really important to understand for us since we are optimizing architecture forrealtime simulators.

I thank you for your tips.

F-Xavier

mecej4 · ‎10-12-2010

Your DO 30, DO 40 and DO 50 loops will also benefit from having the order of the I and J loops changed, based on the principle of the leftmost subscript changing first in order to allow memory/cache to be accessed with unit stride.

These changes will likely reduce the run time by a factor of 3 to 4. On my PC, with a Core2-Duo E8400, running Linux X64 and IFort 11.1.073, I obtain a run-time of 2.73 s with two threads.

TimP · ‎10-12-2010

Assuming you set -O3, the optimizations applied to the 2 inner loops of matrix multiply are difficult to understand. Basically, allowing the compiler to make an inner stride 1 loop is required for efficient vectorization and avoiding false sharing of cache lines among the threads. Under OpenMP, the compiler isn't allowed to second guess you about which loops are parallelized. You might think that collapse(2) would help with more useful cases of smaller NCB, but you would need to do more then to help the compiler get the best inner loop. If you set -Qopt-report, you should get reports on loop vectorization and interchange.
It's possible the compiler could find useful SSE4 optimizations if you allowed it on the appropriate Intel CPUs.
I doubt that schedule options other than default will be good for this, but you're entitled to experiment.
You must learn the usage of KMP_AFFINITY environment variable setting. For an Intel CPU without HyperThread enabled, SET KMP_AFFINITY=compact should work. With HyperThread, you might start with KMP_AFFINITY=compact,1,verbose and set OMP_NUM_THREADS to the number of cores, to see whether you have correctly placed 1 thread on each core.
For the AMD machine, you need to understand the BIOS numbering of the cores. Supposing they are numbered in order, with 0-5 on one die of CPU 0, 6-11 on the other die of CPU 0, etc. then it may be as simple as KMP_AFFINITY="proclist=[0-23],explicit"
It will be interesting to see if the docs for the next major compiler version are able to explain how to use KMP_AFFINITY on the AMD platform. Of course, that one is too new to have been dealt with for 11.1.

jimdempseyatthecove · ‎10-12-2010

F-Xavier,

You might be interested in the .PDF file of a 5-part article I recently posted on Intel Software Network Parallel Programming Community:

http://www.quickthreadprogramming.com/Superscalar_programming_101_parts_1-5.pdf

This article addresses various strategies for matrix multiply on several different platforms. At the bottom of the .PDF is an appendix (not available on the ISN website) that contains the source code for the test program (sans the Cilk++ program).

Note, although the best performing tests were written for use with QuickThread some reasonably good methods not using QuickThread are present in the test program. These you should be able to adapt for your purposes.

Notes:

The code is written in C++, any good programmer can convert to FORTRAN, I am sure you will not have problems with this.

The QuickThreadcode use for the samples in the article did not incorporate tiling so you will note that as the array size exceeds the Last Level Cache (L3 or L2 as the case may be) that the performance precipitously drops off. Whereas the Cilk++ technique did incorporate tiling, so eventually there was a crossover in performance shortly after the dataset size exceeded the cache size. I did not have the time to incorporate the tiling into the QuickThread Parallel Tag Team Transpose method in the test program. As time permits I will incorporate tiling as well as different strategies.

The purpose of the article was not expressly to show you how to perform matrix multiplication in parallel.
Rather, the purpose was to show you how to coordinate cache level sharing. Principally the sharing of the SMT siblings (HT siblings) whereby they share L1 and L2 caches on most processors. When you can coordinate the work performed by the HT siblings whereby one HT thread reuses the same HT sibling's preloading of L1 and L2 you can experience a rather dramatic performance boost. Today's processors SSE instruction latencies is on the order of the L1 cache latency. Whererepresentative L1, L2, L3 and memory Latencies are 4, 10, 56, 254 and 4, 9, 47 226. When you can avoid the L1 and L2 cache evictions by the HT siblings, the L1 SSE latencies might double (due to one FPU resource), while the latencies to L2 and L3 will remain relatively the same.

The real advantage of the Parallel Tag Team approach is one of the SMT siblings pays the price of reading the RAM (226) into L1 while the other SMT sibling gets access via L1 (4 or 8 if you wish).

Jim Dempsey

TimP · ‎10-12-2010

As we've pointed out most of what you were supposed to do, if this is a homework problem:
The 4 loops on J appear eligible to be fused into a single parallel loop, making it clear that the array segments assigned to each thread can remain in cache.
Rather than checking thread id for every thread, then choosing to execute the single thread region for thread 0, it should be more efficient as number of threads increases simply to use !$omp single region.

François-Xavier · ‎10-13-2010

Dear all,

I have well read your posts which contains precious informations as usual.

About this "school" problem, you are right in what you say and i agree with proposed optimisations.

In fact, my key problem was the fact that, in sequential run, i noticed so much difference between very recent AMD CPU and INTEL CPU in allour industrial Fortran programs. This school problem follows this rule.

I thank you for the links, tips,... about openmp. some of them were already known but some absolutely not. Maybe i will soon ask you some more questions.

Franois-Xavier