- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hello,
We are developing a large scale parallel (with MPI) computational Fluid Dynamics (CFD) code. In august 2017, for a particular test case, the code was running at 12,000 iterations for 30 minutes on a specific supercomputer and for a specific number of procs. Currently, for the same test case (on the same supercomputer and with same number of procs), the code only runs about 4,000 iterations for 30 minutes. I still have the old version of the code so I was able to reproduce this difference. I generally compile the code with the following flags (see below). Since the code did not change dramatically since last august, I am wondering if due to some reason, the optimization of the code is being dropped (maybe some subroutines are too long or,...). I am now compiling the code with -opt-report flag but I wanted to know if some of you had some advice on how to proceed with that issue.
Thanks,
Anthony
-i4 -r8 -132 -O3 -g -cpp -I$(paramesh_dir)/headers
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Anthony,
you should provide more information. At least which compiler version you have used in august 2017 and recently (e.g. PSXE 2018 update 1). Further, kernel, Intel run time lib versions, etc. might help (recent mitigation for spectre/meltdown active?).
Why do you compile with full debug information (-g)? Further -r8 implies that you might mix single and double precision within the code? That might prevent the optimizer to get the optimum out of the code.
If you have access to vtune, you could monitor the hot spots. Scrutinizing the optimization reports might also help to identify differences.
In general, no detailed answer can be provided.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Johannes,
Full debug information is required if you intend to VTune the optimized code.
Anthony,
You might want to ask the supercomputer center if anything has changed since your last run. There are many variables:
Number of CPUs
Number of cores
Number of threads/core
Cache size
Clock Rate
How mpirun is distributing the workload
What else may be running on the same cluster.
...
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
@Jim, yes, for sure for VTune full debug makes sense. For production code it produces overhead and would make no sense for getting out the last quantum of speed, am I right?
I would run VTune on a single node (ideally on a workstation, where you have full control on compiler and workload) with old code base and new code base to narrow the reason of the performance difference. If the code changes are the reason at all.
I totally agree Jim. There a tons of variables, which can influence the performance of a run on a supercomputer and even more important, which you normally cannot influence.
Best regards, Johannes
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hello,
I just run the code with perftools-lite on Cray. I am attaching the log file for the new version of the code (detailed_report_newcode.log) and for the old version of the code (detailed_report.log). You will see that in detailed_report_newcode.log, there is a __intel_memset that does not show up in detailed_report.log. Now look at lines 355-356 of detailed_report_newcode.log, it gives the following info:
bitcart/src/navier_stokes_newest/fill_guardcell_prims.F90 line.639
The difference between the new code and the old one is that somebody in our group added:
work=0.d0 in fill_guardcell_prims.F90
where work is an array: work(i,j,k,lb,1:nvars).
If I comment out work=0.d0, then I get the proper performance (no __intel_memset issue), whereas if I uncomment it, I get the 3 fold perfomance decrease. How come a simple initialization of an array can have such detrimental effect on performance?
Thanks,
Anthony
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
If 'work' is a large array, and that definition is happening many times, then you should expect a material performance hit. Is that definition required?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Johan,
The inclusion of the debug information has no impact on the performance of the production code. The image file is larger to include the "text" of the debug database. The executable code is the same.
Anthony,
Read IanH's response. The initializing of the array may be a requirement to obtain correct results (i.e. you assume initial values are 0.0d0). Only you can tell if this is a requirement. Initialization (work=0.0d0) will call __intel_memset). This zeroing may be a once-only requirement or it may be on each iteration (you must determine this).
Additional information:
If your MPI application is also parallelized using OpenMP, you will want to affinitize your threads .AND. initialize the work array using the same OpenMP loop structure as your computation is ordered. This is so memory placement occurs preponderantly on the node of the thread using the array (via "first touch").
I also suggest you run the application as 1 rank in a VTune session (let it run for a few minutes). Then in VTune view the hot spots (disassembly) to assure the array references are preponderantly vector as opposed to scalar. If they are not, you may want to consider restructuring your array indexes.
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
My recollection is that on Linux, if you ask for debugging, that disables optimization. It would be useful asking for a listing file (-list) and looking at the summary of options it presents.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
>>My recollection is that on Linux, if you ask for debugging, that disables optimization.
This depends on how you ask.
If you request Debug build (say from Eclipse) then this also defaults to no optimization.
If you request Release build, you can add the compiler option to emit debug information. This does not affect optimization.
Note, you may also be required to instruct the Linker to not strip debug information.
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I was referring to command line. The documentation for -g says:
This option turns off option -O2 and makes option -O0 the default unless option -O2 (or higher) is explicitly specified in the same command line.
Command arguments might be processed left to right, so try putting -O3 after -g and see if that helps. (Lorri will probably chime in now to say that it doesn't matter in this case.)
In any event I would suggest trying without -g and see where that gets you. Sometimes debug information does subtly change generated code, filling in descriptors that are otherwise unneeded, etc.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I wrote a little example to test sequence and the influence of -O3 and -g. The sequence seems to have no impact (-g O3 or -O3 -g). I interpret the documentation snipped from Steve's post that the presence of -O3 overrides the -O0 set by -g.
Further the presence of -g has no impact on run time, as Jim said before. At least for the dull test below, which might be not the best for checking this. I was wrong in my assumption on the performance influence of -g. Sorry, for the confusion.
! ! test code for evaluating the impact of -g option ! program debug_opt_test use, intrinsic :: iso_fortran_env, only : rk =>real64, int64 implicit none integer :: i, j, k integer, parameter :: i_len = 100000, j_len =100, k_len=10 integer(int64) :: i_tic, i_toc, c_rate real(rk),allocatable :: dummy_array(:,:,:), result ! time CALL SYSTEM_CLOCK (i_tic, c_rate) allocate(dummy_array(i_len,j_len,k_len)) dummy_array(1,1,1) = 1.0_rk do i = 1, i_len do j = 1, j_len do k = 1, k_len dummy_array(i,j,k) = dummy_array(1,1,1) + real(i,rk) + real(j,rk) + real(k,rk) end do end do end do result = sum(dummy_array(1:i_len,1,1) ) write(*,*) result CALL SYSTEM_CLOCK (i_toc, c_rate) write(*,'("time in seconds:",1f16.8)') real((i_toc-i_tic),rk)/real(c_rate,rk) end program debug_opt_test
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Dear All,
Thanks for all comments. I also ran the code by completely removing the -g flag and did not see any performance improvement. I am now looking into VTune.
Thanks again,
Anthony
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page