3 fold performance decrease

haas__anthony · ‎02-18-2018

Hello,

We are developing a large scale parallel (with MPI) computational Fluid Dynamics (CFD) code. In august 2017, for a particular test case, the code was running at 12,000 iterations for 30 minutes on a specific supercomputer and for a specific number of procs. Currently, for the same test case (on the same supercomputer and with same number of procs), the code only runs about 4,000 iterations for 30 minutes. I still have the old version of the code so I was able to reproduce this difference. I generally compile the code with the following flags (see below). Since the code did not change dramatically since last august, I am wondering if due to some reason, the optimization of the code is being dropped (maybe some subroutines are too long or,...). I am now compiling the code with -opt-report flag but I wanted to know if some of you had some advice on how to proceed with that issue.

Thanks,

Anthony

-i4 -r8 -132 -O3 -g -cpp -I$(paramesh_dir)/headers

Johannes_Rieke · ‎02-20-2018

Hi Anthony,

you should provide more information. At least which compiler version you have used in august 2017 and recently (e.g. PSXE 2018 update 1). Further, kernel, Intel run time lib versions, etc. might help (recent mitigation for spectre/meltdown active?).

Why do you compile with full debug information (-g)? Further -r8 implies that you might mix single and double precision within the code? That might prevent the optimizer to get the optimum out of the code.

If you have access to vtune, you could monitor the hot spots. Scrutinizing the optimization reports might also help to identify differences.

In general, no detailed answer can be provided.

jimdempseyatthecove · ‎02-20-2018

Johannes,

Full debug information is required if you intend to VTune the optimized code.

Anthony,

You might want to ask the supercomputer center if anything has changed since your last run. There are many variables:

Number of CPUs
Number of cores
Number of threads/core
Cache size
Clock Rate
How mpirun is distributing the workload
What else may be running on the same cluster.
...

Jim Dempsey

Johannes_Rieke · ‎02-20-2018

@Jim, yes, for sure for VTune full debug makes sense. For production code it produces overhead and would make no sense for getting out the last quantum of speed, am I right?

I would run VTune on a single node (ideally on a workstation, where you have full control on compiler and workload) with old code base and new code base to narrow the reason of the performance difference. If the code changes are the reason at all.

I totally agree Jim. There a tons of variables, which can influence the performance of a run on a supercomputer and even more important, which you normally cannot influence.

Best regards, Johannes

haas__anthony · ‎02-20-2018

Hello,

I just run the code with perftools-lite on Cray. I am attaching the log file for the new version of the code (detailed_report_newcode.log) and for the old version of the code (detailed_report.log). You will see that in detailed_report_newcode.log, there is a __intel_memset that does not show up in detailed_report.log. Now look at lines 355-356 of detailed_report_newcode.log, it gives the following info:

bitcart/src/navier_stokes_newest/fill_guardcell_prims.F90 line.639

The difference between the new code and the old one is that somebody in our group added:

work=0.d0 in fill_guardcell_prims.F90

where work is an array: work(i,j,k,lb,1:nvars).

If I comment out work=0.d0, then I get the proper performance (no __intel_memset issue), whereas if I uncomment it, I get the 3 fold perfomance decrease. How come a simple initialization of an array can have such detrimental effect on performance?

Thanks,

Anthony

IanH · ‎02-20-2018

If 'work' is a large array, and that definition is happening many times, then you should expect a material performance hit. Is that definition required?

jimdempseyatthecove · ‎02-21-2018

Johan,

The inclusion of the debug information has no impact on the performance of the production code. The image file is larger to include the "text" of the debug database. The executable code is the same.

Anthony,

Read IanH's response. The initializing of the array may be a requirement to obtain correct results (i.e. you assume initial values are 0.0d0). Only you can tell if this is a requirement. Initialization (work=0.0d0) will call __intel_memset). This zeroing may be a once-only requirement or it may be on each iteration (you must determine this).

Additional information:

If your MPI application is also parallelized using OpenMP, you will want to affinitize your threads .AND. initialize the work array using the same OpenMP loop structure as your computation is ordered. This is so memory placement occurs preponderantly on the node of the thread using the array (via "first touch").

I also suggest you run the application as 1 rank in a VTune session (let it run for a few minutes). Then in VTune view the hot spots (disassembly) to assure the array references are preponderantly vector as opposed to scalar. If they are not, you may want to consider restructuring your array indexes.

Jim Dempsey

Steve_Lionel · ‎02-21-2018

My recollection is that on Linux, if you ask for debugging, that disables optimization. It would be useful asking for a listing file (-list) and looking at the summary of options it presents.

jimdempseyatthecove · ‎02-22-2018

>>My recollection is that on Linux, if you ask for debugging, that disables optimization.

This depends on how you ask.

If you request Debug build (say from Eclipse) then this also defaults to no optimization.

If you request Release build, you can add the compiler option to emit debug information. This does not affect optimization.
Note, you may also be required to instruct the Linker to not strip debug information.

Jim Dempsey

Steve_Lionel · ‎02-22-2018

I was referring to command line. The documentation for -g says:

This option turns off option -O2 and makes option -O0 the default unless option -O2 (or higher) is explicitly specified in the same command line.

Command arguments might be processed left to right, so try putting -O3 after -g and see if that helps. (Lorri will probably chime in now to say that it doesn't matter in this case.)

In any event I would suggest trying without -g and see where that gets you. Sometimes debug information does subtly change generated code, filling in descriptors that are otherwise unneeded, etc.

Johannes_Rieke · ‎02-23-2018

I wrote a little example to test sequence and the influence of -O3 and -g. The sequence seems to have no impact (-g O3 or -O3 -g). I interpret the documentation snipped from Steve's post that the presence of -O3 overrides the -O0 set by -g.

Further the presence of -g has no impact on run time, as Jim said before. At least for the dull test below, which might be not the best for checking this. I was wrong in my assumption on the performance influence of -g. Sorry, for the confusion.

!
! test code for evaluating the impact of -g option
!
program debug_opt_test
  use, intrinsic :: iso_fortran_env, only : rk =>real64, int64
  implicit none
  
  integer                       :: i, j, k
  integer, parameter   :: i_len = 100000, j_len =100, k_len=10
  integer(int64)             :: i_tic, i_toc, c_rate
  real(rk),allocatable ::  dummy_array(:,:,:), result
  
  ! time
  CALL SYSTEM_CLOCK (i_tic, c_rate)

  
  allocate(dummy_array(i_len,j_len,k_len))
  
  dummy_array(1,1,1) = 1.0_rk
  do i = 1, i_len
    do j = 1, j_len
      do k = 1, k_len
        dummy_array(i,j,k) = dummy_array(1,1,1) + real(i,rk) + real(j,rk) + real(k,rk)
      end do
    end do
  end do
  
  result = sum(dummy_array(1:i_len,1,1) )
  write(*,*) result
  
  CALL SYSTEM_CLOCK (i_toc, c_rate)
  write(*,'("time in seconds:",1f16.8)') real((i_toc-i_tic),rk)/real(c_rate,rk)


end program debug_opt_test

haas__anthony · ‎02-23-2018

Dear All,

Thanks for all comments. I also ran the code by completely removing the -g flag and did not see any performance improvement. I am now looking into VTune.

Thanks again,

Anthony