- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
I found a very strange thing: adding more memory allocations leads to significantly more data load instructions and CPU_TIME.
I know this sounds weird so I posted the code below to illustrate my problem (I have tried my best to simplify it):
subroutine ARK2(region) USE ModGlobal USE ModDataStruct USE ModIO USE ModDerivBuildOps USE ModDeriv USE ModMetrics USE ModAdvection USE ModMPI Implicit None ! ... Incoming variables type(t_region), pointer :: region ! ... local variables integer :: rkStep, i, j, k, ng, ARK2_nStages, ImplicitFlag type(t_grid), pointer :: grid type(t_mixt), pointer :: state type(t_mixt_input), pointer :: input real(rfreal), pointer :: cv(:,:), dt(:), b_vec_exp(:), rhs_explicit(:,:,:) integer :: nGrids, nCells, nCv nGrids = region%nGrids if (rk_alloc .eqv. .true.) then do ng = 1, nGrids grid => region%grid(ng) input => grid%input call additive_RK_coeff(input, grid, ImplicitFlag) end do rk_alloc = .false. if (myrank == 0) write (*,'(A)') 'PlasComCM: ==> Using ARK2 time integration <==' end if grid => region%grid(1) state => region%state(1) input => grid%input ARK2_nStages = grid%ARK2_nStages ! ---------------------------- ! ... memory allocation PART 1 ! ---------------------------- if (.not.allocated(state%rhs) .eqv. .true.) allocate(state%rhs(grid%nCells, input%nCv)) if (.not.allocated(state%rhs_explicit) .eqv. .true.) allocate(state%rhs_explicit(grid%nCells, input%nCv, ARK2_nStages)) if (.not.allocated(state%timeOld) .eqv. .true.) allocate(state%timeOld(grid%nCells)) if (.not.allocated(state%cfl) .eqv. .true.) allocate(state%cfl(grid%nCells)) if (.not.allocated(state%cvOld) .eqv. .true.) allocate(state%cvOld(grid%nCells,input%nCv)) if (.not.allocated(state%dt) .eqv. .true.) allocate(state%dt(grid%nCells)) if (.not.allocated(time_g) .eqv. .true.) allocate(time_g(grid%nCells)) if (.not.allocated(timeOld_g) .eqv. .true.) allocate(timeOld_g(grid%nCells)) if (.not.allocated(dt_g) .eqv. .true.) allocate(dt_g(grid%nCells)) if (.not.allocated(rhs_explicit_g) .eqv. .true.) allocate(rhs_explicit_g(grid%nCells, input%nCv, ARK2_nStages)) if (.not.allocated(state_rhs_g) .eqv. .true.) allocate(state_rhs_g(grid%nCells, input%nCv)) if (.not.allocated(JAC_g) .eqv. .true.) allocate(JAC_g(grid%nCells)) if (.not.allocated(cv_g) .eqv. .true.) allocate(cv_g(grid%nCells,input%nCv)) if (.not.allocated(cvOld_g) .eqv. .true.) allocate(cvOld_g(grid%nCells,input%nCv)) ! ---------------------------------------------------------------------------------- ! ... memory allocation PART 2 (adding these 5 memory allocations leads to significantly ! ... more data load instructions and CPU_TIME for the loop in the bottom!!!) ! ---------------------------------------------------------------------------------- if (.not.allocated(a_mat_exp_g) .eqv. .true.) allocate(a_mat_exp_g(ARK2_nStages,ARK2_nStages)) if (.not.allocated(a_mat_imp_g) .eqv. .true.) allocate(a_mat_imp_g(ARK2_nStages,ARK2_nStages)) if (.not.allocated(b_vec_exp_g) .eqv. .true.) allocate(b_vec_exp_g(ARK2_nStages)) if (.not.allocated(b_vec_imp_g) .eqv. .true.) allocate(b_vec_imp_g(ARK2_nStages)) if (.not.allocated(c_vec_g) .eqv. .true.) allocate(c_vec_g(ARK2_nStages)) ! ... dereference pointers cv => state%cv dt => state%dt b_vec_exp => grid%ARK2_b_vec_exp rhs_explicit => state%rhs_explicit nCv = input%nCv nCells = grid%nCells ! ---------------------------------------------------------- ! ... Adding memory allocation PART 2 leads to significantly ! ... more data load instructions and CPU_TIME for this loop!!! ! ---------------------------------------------------------- do j = 1, ARK2_nStages do k = 1, nCv !DIR$ SIMD do i = 1, nCells cv(i,k) = cv(i,k) + dt(i) * b_vec_exp(j) * rhs_explicit(i,k,j) end do end do end do end subroutine ARK2
As shown above, in ARK2 I have 2 memory allocation parts and 1 loop. My finding is that :
If I only have memory allocation PART1 (without PART2), the loop can run very fast with small number of data load instructions;
If I have both memory allocation PART1 and PART2, the loop will run very slowly with quite large number of data load instructions.
Using TAU and PAPI, I measured the CPU_TIME and memory usage of the loop for these two cases, here are the results:
CPU_TIME (seconds) data load instructions L1 Cache Hits L2 Cache Hits L3 Cache Hits Main Memory Hits
Without PART2 1.82 3.8E09 87% 3% 10% 0%
With PART2 13.24 5.6E10 99% 1% 0% 0%
We can see that the CPU_TIME and data load instructions increase more than 10 times. From the cache usage results, it seems to me that PART2 adds a lot of useless L1 cache hits. These results are very confusing to me since PART2 is just allocating some memory. How can it have such a huge influence on the performance of the loop? These seems to be a limit on the size of the memory that I can allocate (regarding performance).
I promise these results are repeatable. I compiled the code using ifort 13.1.0 with -O2 -xHost options and SIMD directives. I ran the code on Intel E5 Sandy Bridge processor with only one MPI process. I will truly appreciate any hints regarding what is going on here. Thanks for your time, help and patience for reading this long story.
Best regards,
Wentao
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Just to make things more clear:
(1) The very short loop at the beginning (line 28 ~ line 32) is negligible. I am profiling the one in the bottom (line 82 ~ line 89).
(2) CPU_TIME increased more than 7 times and data load instructions increased more than 10 times.
Thanks!
Best regards,
Wentao
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Without being able to compile the code (lots of modules not present), it's almost impossible to speculate as to what is going on. The one thing I would suggest is that you verify the compiler did not evaporate the loop you're measuring in the "fast" case. It's not immediately evident that it can legally do so, but the rest of the program would be needed to see.
If you would like an actual analysis, you'll need to provide enough so that the subroutine can be compiled.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Steve,
Many thanks your reply. I measured the CPU_TIME for the whole program and found the same issue. Attached codelet_intel.zip is the whole code. We have highly simplified it to make the workflow pretty easy as follow:
1. Compile and run the “slow” version (with memory allocation PART 2)
login1$ make -f Makefile.intel login1$ time ./bin/plascomcm real 0m13.796s user 0m12.816s sys 0m0.475s
2. Comment the 5 memory allocations of PART2 in subroutine ARK2, which is contained in ModRungeKutta.fpp.
(Please note that we should modify ModRungeKutta.fpp, rather than ModRungeKutta.f90.)
login1$ cd src login1$ vi ModRungeKutta.fpp
! ------------------------------------------------------------------------------------- ! ... memory allocation PART 2 (adding these 5 memory allocations leads to significantly ! ... more data loads and CPU_TIME for the loop in the bottom!!!) ! -------------------------------------------------------------------------------------- !if (.not.allocated(a_mat_exp_g) .eqv. .true.) allocate(a_mat_exp_g(ARK2_nStages,ARK2_nStages)) !if (.not.allocated(a_mat_imp_g) .eqv. .true.) allocate(a_mat_imp_g(ARK2_nStages,ARK2_nStages)) !if (.not.allocated(b_vec_exp_g) .eqv. .true.) allocate(b_vec_exp_g(ARK2_nStages)) !if (.not.allocated(b_vec_imp_g) .eqv. .true.) allocate(b_vec_imp_g(ARK2_nStages)) !if (.not.allocated(c_vec_g) .eqv. .true.) allocate(c_vec_g(ARK2_nStages))
3. Compile and run again you will find now it runs very fast.
login1$ cd .. login1$ make -f Makefile.intel login1$ time ./bin/plascomcm real 0m2.785s user 0m2.188s sys 0m0.476s
If you want to check the assembly, you can use
login1$ make -f Makefile.intel clean login1$ make -f Makefile.assembly
Thanks again for helping me with this. I am glad to provide any further information.
I would appreciate it if you can keep this code private.
Best regards,
Wentao
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I got the ZIP and removed it from the forum. I will look at this next week.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Yes, I got it. Thanks.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I can reproduce this. It isn't the memory allocations, per se, it's that there's just a bit of extra code in the compilation that, for some reason, prevents the vectorizer from fully vectorizing the loop in question. I find that I can enable all five allocates but remove the other routine in this source and the loop will fully vectorize. I have sent this on to the developers as issue DPD200254646.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Steve Lionel (Intel) wrote:
I can reproduce this. It isn't the memory allocations, per se, it's that there's just a bit of extra code in the compilation that, for some reason, prevents the vectorizer from fully vectorizing the loop in question. I find that I can enable all five allocates but remove the other routine in this source and the loop will fully vectorize. I have sent this on to the developers as issue DPD200254646.
Thanks!

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page