Memory allocation leads to significant loss of performance

Wentao_Z_ · ‎02-27-2014

Hi,

I found a very strange thing: adding more memory allocations leads to significantly more data load instructions and CPU_TIME.
I know this sounds weird so I posted the code below to illustrate my problem (I have tried my best to simplify it):

  subroutine ARK2(region)
    
    USE ModGlobal
    USE ModDataStruct
    USE ModIO
    USE ModDerivBuildOps
    USE ModDeriv
    USE ModMetrics
    USE ModAdvection
    USE ModMPI

    Implicit None

! ... Incoming variables
    type(t_region), pointer :: region

! ... local variables
    integer :: rkStep, i, j, k, ng, ARK2_nStages, ImplicitFlag
    type(t_grid), pointer :: grid
    type(t_mixt), pointer :: state
    type(t_mixt_input), pointer :: input
    real(rfreal), pointer :: cv(:,:), dt(:), b_vec_exp(:), rhs_explicit(:,:,:)
    integer :: nGrids, nCells, nCv

    nGrids = region%nGrids

    if (rk_alloc .eqv. .true.) then
      do ng = 1, nGrids
        grid  => region%grid(ng)
        input => grid%input 
        call additive_RK_coeff(input, grid, ImplicitFlag)
      end do
      rk_alloc = .false.
      if (myrank == 0) write (*,'(A)') 'PlasComCM: ==> Using ARK2 time integration <=='
    end if
    
    grid => region%grid(1)
    state => region%state(1)
    input => grid%input
    ARK2_nStages = grid%ARK2_nStages
    
! ----------------------------
! ... memory allocation PART 1
! ----------------------------
    if (.not.allocated(state%rhs) .eqv. .true.) allocate(state%rhs(grid%nCells, input%nCv))
    if (.not.allocated(state%rhs_explicit) .eqv. .true.) allocate(state%rhs_explicit(grid%nCells, input%nCv, ARK2_nStages))
    if (.not.allocated(state%timeOld) .eqv. .true.) allocate(state%timeOld(grid%nCells))
    if (.not.allocated(state%cfl)     .eqv. .true.) allocate(state%cfl(grid%nCells))
    if (.not.allocated(state%cvOld)   .eqv. .true.) allocate(state%cvOld(grid%nCells,input%nCv))
    if (.not.allocated(state%dt)      .eqv. .true.) allocate(state%dt(grid%nCells))
    if (.not.allocated(time_g) .eqv. .true.) allocate(time_g(grid%nCells))
    if (.not.allocated(timeOld_g) .eqv. .true.) allocate(timeOld_g(grid%nCells))
    if (.not.allocated(dt_g) .eqv. .true.) allocate(dt_g(grid%nCells))
    if (.not.allocated(rhs_explicit_g) .eqv. .true.) allocate(rhs_explicit_g(grid%nCells, input%nCv, ARK2_nStages))
    if (.not.allocated(state_rhs_g) .eqv. .true.) allocate(state_rhs_g(grid%nCells, input%nCv))
    if (.not.allocated(JAC_g) .eqv. .true.) allocate(JAC_g(grid%nCells))
    if (.not.allocated(cv_g)   .eqv. .true.) allocate(cv_g(grid%nCells,input%nCv))
    if (.not.allocated(cvOld_g)   .eqv. .true.) allocate(cvOld_g(grid%nCells,input%nCv))

! ----------------------------------------------------------------------------------
! ... memory allocation PART 2 (adding these 5 memory allocations leads to significantly 
! ... more data load instructions and CPU_TIME for the loop in the bottom!!!)
! ----------------------------------------------------------------------------------
    if (.not.allocated(a_mat_exp_g) .eqv. .true.) allocate(a_mat_exp_g(ARK2_nStages,ARK2_nStages))
    if (.not.allocated(a_mat_imp_g) .eqv. .true.) allocate(a_mat_imp_g(ARK2_nStages,ARK2_nStages))
    if (.not.allocated(b_vec_exp_g) .eqv. .true.) allocate(b_vec_exp_g(ARK2_nStages))
    if (.not.allocated(b_vec_imp_g) .eqv. .true.) allocate(b_vec_imp_g(ARK2_nStages))
    if (.not.allocated(c_vec_g) .eqv. .true.) allocate(c_vec_g(ARK2_nStages))

! ... dereference pointers
    cv => state%cv
    dt => state%dt
    b_vec_exp => grid%ARK2_b_vec_exp
    rhs_explicit => state%rhs_explicit
    nCv = input%nCv
    nCells = grid%nCells

! ----------------------------------------------------------
! ... Adding memory allocation PART 2 leads to significantly 
! ... more data load instructions and CPU_TIME for this loop!!!
! ----------------------------------------------------------
    do j = 1, ARK2_nStages
      do k = 1, nCv
!DIR$ SIMD
        do i = 1, nCells
          cv(i,k) = cv(i,k) + dt(i) * b_vec_exp(j) * rhs_explicit(i,k,j)
        end do
      end do
    end do

  end subroutine ARK2

As shown above, in ARK2 I have 2 memory allocation parts and 1 loop. My finding is that :
If I only have memory allocation PART1 (without PART2), the loop can run very fast with small number of data load instructions;
If I have both memory allocation PART1 and PART2, the loop will run very slowly with quite large number of data load instructions.

Using TAU and PAPI, I measured the CPU_TIME and memory usage of the loop for these two cases, here are the results:
CPU_TIME (seconds) data load instructions L1 Cache Hits L2 Cache Hits L3 Cache Hits Main Memory Hits
Without PART2 1.82 3.8E09 87% 3% 10% 0%
With PART2 13.24 5.6E10 99% 1% 0% 0%

We can see that the CPU_TIME and data load instructions increase more than 10 times. From the cache usage results, it seems to me that PART2 adds a lot of useless L1 cache hits. These results are very confusing to me since PART2 is just allocating some memory. How can it have such a huge influence on the performance of the loop? These seems to be a limit on the size of the memory that I can allocate (regarding performance).

I promise these results are repeatable. I compiled the code using ifort 13.1.0 with -O2 -xHost options and SIMD directives. I ran the code on Intel E5 Sandy Bridge processor with only one MPI process. I will truly appreciate any hints regarding what is going on here. Thanks for your time, help and patience for reading this long story.

Best regards,
Wentao

Wentao_Z_ · ‎02-27-2014

Just to make things more clear:

(1) The very short loop at the beginning (line 28 ~ line 32) is negligible. I am profiling the one in the bottom (line 82 ~ line 89).

(2) CPU_TIME increased more than 7 times and data load instructions increased more than 10 times.

Thanks!

Best regards,
Wentao

Steven_L_Intel1 · ‎02-28-2014

Without being able to compile the code (lots of modules not present), it's almost impossible to speculate as to what is going on. The one thing I would suggest is that you verify the compiler did not evaporate the loop you're measuring in the "fast" case. It's not immediately evident that it can legally do so, but the rest of the program would be needed to see.

If you would like an actual analysis, you'll need to provide enough so that the subroutine can be compiled.

Wentao_Z_ · ‎02-28-2014

Hi Steve,

Many thanks your reply. I measured the CPU_TIME for the whole program and found the same issue. Attached codelet_intel.zip is the whole code. We have highly simplified it to make the workflow pretty easy as follow:

1. Compile and run the “slow” version (with memory allocation PART 2)

login1$ make -f Makefile.intel 
login1$ time ./bin/plascomcm 
real   0m13.796s
user   0m12.816s
sys    0m0.475s

2. Comment the 5 memory allocations of PART2 in subroutine ARK2, which is contained in ModRungeKutta.fpp.
(Please note that we should modify ModRungeKutta.fpp, rather than ModRungeKutta.f90.)

login1$ cd src
login1$ vi ModRungeKutta.fpp

!  -------------------------------------------------------------------------------------
! ... memory allocation PART 2 (adding these 5 memory allocations leads to significantly 
! ... more data loads and CPU_TIME for the loop in the bottom!!!)
! --------------------------------------------------------------------------------------
!if (.not.allocated(a_mat_exp_g) .eqv. .true.) allocate(a_mat_exp_g(ARK2_nStages,ARK2_nStages))
!if (.not.allocated(a_mat_imp_g) .eqv. .true.) allocate(a_mat_imp_g(ARK2_nStages,ARK2_nStages))
!if (.not.allocated(b_vec_exp_g) .eqv. .true.) allocate(b_vec_exp_g(ARK2_nStages))
!if (.not.allocated(b_vec_imp_g) .eqv. .true.) allocate(b_vec_imp_g(ARK2_nStages))
!if (.not.allocated(c_vec_g) .eqv. .true.) allocate(c_vec_g(ARK2_nStages))

3. Compile and run again you will find now it runs very fast.

login1$ cd ..
login1$ make -f Makefile.intel
login1$ time ./bin/plascomcm 
real    0m2.785s
user    0m2.188s
sys     0m0.476s

If you want to check the assembly, you can use

login1$ make -f Makefile.intel clean
login1$ make -f Makefile.assembly

Thanks again for helping me with this. I am glad to provide any further information.
I would appreciate it if you can keep this code private.

Best regards,
Wentao

Steven_L_Intel1 · ‎02-28-2014

I got the ZIP and removed it from the forum. I will look at this next week.

Steven_L_Intel1 · ‎02-28-2014

Yes, I got it. Thanks.

Steven_L_Intel1 · ‎03-19-2014

I can reproduce this. It isn't the memory allocations, per se, it's that there's just a bit of extra code in the compilation that, for some reason, prevents the vectorizer from fully vectorizing the loop in question. I find that I can enable all five allocates but remove the other routine in this source and the loop will fully vectorize. I have sent this on to the developers as issue DPD200254646.

Wentao_Z_ · ‎03-19-2014

Steve Lionel (Intel) wrote:

I can reproduce this. It isn't the memory allocations, per se, it's that there's just a bit of extra code in the compilation that, for some reason, prevents the vectorizer from fully vectorizing the loop in question. I find that I can enable all five allocates but remove the other routine in this source and the loop will fully vectorize. I have sent this on to the developers as issue DPD200254646.

Thanks!