- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
I have two OpenMP loops that ifort (Version 16.0.1.150 Build 20151021) refuses to parallelize. I've added "-qopenmp -opt-report-file=stdout -opt-report-phase:openmp " to my compile/link commands and see this in the output:
Begin optimization report for: MODULE_VARSOLVER_QG::VAR_SOLVER
Report from: OpenMP optimizations [openmp]
Optimization 'PRE' reduced: function size or variable count limit exceeded: use -override_limits / -Qoverride_limits to override
===========================================================================
I am not sure what that is telling me. Is it saying that the function/subroutine calls inside the loop are too big for the compiler to generate threaded code? If I add the " -override_limits" option, which does not appear to be documented in the man page or anywhere online, the error goes away, but I also do not get an OpenMP optimization report for those loops. I have previously never encountered any limits to the "stuff" one can put inside a threaded region before that would cause this, and have seen functioning OpenMP loops with huge amounts of stuff in them.
I have included one of the the loops for reference even though it's probably too complicated because I haven't yet reduced it to a small "reproducer".
!$OMP PARALLEL DO SHARED (bkg_len,bht_ino,bkg_cov,brh_cov,bkg_vec,anl_vec,tlm_vec,new_vec,tim_len,alph,mthd,B,Q,bkg_config) DEFAULT(PRIVATE) do t=tim_len,1,-1 tim_bkc(:,:)=bkg_cov(t,:,:) tim_hrh(:,:)=brh_cov(t,:,:) tim_htr(:)=bht_ino(t,:,1) tim_bkv(:)=bkg_vec(t,:) if(t.eq.tim_len) then mdl_vec(:)=tlm_vec(t,:) if(mthd > 3) mdl_vec(:)=0.5*(tlm_vec(t,:)+anl_vec(t,:)) else if (mthd.eq.3) mdl_vec(:)=tlm_vec(t+1,:) if (mthd > 3) mdl_vec(:)=anl_vec(t+1,:) model = qg_model_type(bkg_config, state_vector=mdl_vec(:), step=t) call model%adv_nsteps(1) model_ADJ = qg_adj_type(bkg_config, state_vector=model%get_state_vector(), trajectory_vector=model%get_state_vector() - mdl_vec(:), step = t + 1) call model_ADJ%adv_nsteps(4) mdl_vec(:) = model_ADJ%get_state_vector() end if if(mthd < 4) ges_vec(:)=mdl_vec(:) if(mthd > 3) ges_vec(:)=0.5*(tlm_vec(t,:)+mdl_vec(:)) tmp_vec(:)=(ges_vec(:)-tim_bkv(:))*(B+Q) dif_vec(:)=(ges_vec(:)-tim_bkv(:))*(B+Q) call pre_con_dif(tim_bkc,tmp_vec) pre_dif(:)=tmp_vec(:) call dgemv("N", bkg_len, bkg_len, 1.d0, tim_hrh, bkg_len, dif_vec, 1, 0.d0, tmp_vec, 1) tmp_vec(:)=pre_dif(:)+tmp_vec(:)-tim_htr(:) call dgemv("N", bkg_len, bkg_len, 1.d0, tim_bkc, bkg_len, tmp_vec, 1, 0.d0, grd_jvc, 1) new_vec(t,:)=ges_vec(:)-grd_jvc(:)*alph if(mthd < 4) tlm_vec(t,:)=new_vec(t,:) end do !$OMP END PARALLEL DO
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
You should be able to do something like this:
subroutine yourSubroutine(yourArgs) ... !$OMP PARALLEL DO SHARED do t=tim_len,1,-1 call containedSubroutine(t) end do ... !$OMP END PARALLEL DO contains subroutine containedSubroutine(t) implicit none integer :: t ! declare all private entities here real, allocatable :: tim_bkc(:,:), tim_hrh(:,:), tim_htr(:), tim_bkv(:) ... ! leave shared variables undefined here ... tim_bkc(:,:)=bkg_cov(t,:,:) tim_hrh(:,:)=brh_cov(t,:,:) tim_htr(:)=bht_ino(t,:,1) tim_bkv(:)=bkg_vec(t,:) if(t.eq.tim_len) then mdl_vec(:)=tlm_vec(t,:) if(mthd > 3) mdl_vec(:)=0.5*(tlm_vec(t,:)+anl_vec(t,:)) else if (mthd.eq.3) mdl_vec(:)=tlm_vec(t+1,:) if (mthd > 3) mdl_vec(:)=anl_vec(t+1,:) model = qg_model_type(bkg_config, state_vector=mdl_vec(:), step=t) call model%adv_nsteps(1) model_ADJ = qg_adj_type(bkg_config, state_vector=model%get_state_vector(), trajectory_vector=model%get_state_vector() - mdl_vec(:), step = t + 1) call model_ADJ%adv_nsteps(4) mdl_vec(:) = model_ADJ%get_state_vector() end if if(mthd < 4) ges_vec(:)=mdl_vec(:) if(mthd > 3) ges_vec(:)=0.5*(tlm_vec(t,:)+mdl_vec(:)) tmp_vec(:)=(ges_vec(:)-tim_bkv(:))*(B+Q) dif_vec(:)=(ges_vec(:)-tim_bkv(:))*(B+Q) call pre_con_dif(tim_bkc,tmp_vec) pre_dif(:)=tmp_vec(:) call dgemv("N", bkg_len, bkg_len, 1.d0, tim_hrh, bkg_len, dif_vec, 1, 0.d0, tmp_vec, 1) tmp_vec(:)=pre_dif(:)+tmp_vec(:)-tim_htr(:) call dgemv("N", bkg_len, bkg_len, 1.d0, tim_bkc, bkg_len, tmp_vec, 1, 0.d0, grd_jvc, 1) new_vec(t,:)=ges_vec(:)-grd_jvc(:)*alph if(mthd < 4) tlm_vec(t,:)=new_vec(t,:) end subroutine containedSubroutine end subroutine yourSubroutine
*** the above is in line with how you have your arrays layed out.
I strongly suggest that you consider rearranging the partitioning index t, to be the last index of your arrays that are partitioned by t.
By doing so, in many places you can avoid upon entry the copying from the t'th slice of an array through use of an expensive gather operation
and then prior to exit, avoid copying to the t'th slice of an array through use of an expensive scatter operation.
The copying of the data is likely avoidable.
Jim Dempsey
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I've had cases in the past where Intel OpenMP failed due apparently to too many private variables. I might suspect the limit may depend on target mode ISA.
If you're curious, you might check whether the suggestion to try override_limits is associated with spills. The "cure" may not improve things.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Interesting. I think it must have something to do with qg_model_type and qg_adj_type. Below is a loop that is identical to the one in my original post except that the l96_model_type and l96_adj_type is being used instead of qg_model_type and qg_adj_type. The latter types contain much more data than the former ones. And the subroutines for the latter are also much bigger than the former ones. But, the compiler has no problem with the l96 versions as shown below. And with that one, I get:
Begin optimization report for: MODULE_VARSOLVER_L96::VAR_SOLVER
Report from: OpenMP optimizations [openmp]
module_varsolver_l96.f90(597:7-597:7):OMP:module_varsolver_l96_mp_var_solver_: OpenMP DEFINED LOOP WAS PARALLELIZED
!$OMP PARALLEL DO SHARED (bkg_len,bht_ino,bkg_cov,brh_cov,bkg_vec,anl_vec,tlm_vec,new_vec,tim_len,alph,mthd,B,Q,bkg_config) DEFAULT(PRIVATE) do t=tim_len,1,-1 tim_bkc(:,:)=bkg_cov(t,:,:) tim_hrh(:,:)=brh_cov(t,:,:) tim_htr(:)=bht_ino(t,:,1) tim_bkv(:)=bkg_vec(t,:) if(t.eq.tim_len) then mdl_vec(:)=tlm_vec(t,:) if(mthd.eq.4) mdl_vec(:)=0.5*(tlm_vec(t,:)+anl_vec(t,:)) else if (mthd.eq.3) mdl_vec(:)=tlm_vec(t+1,:) if (mthd.eq.4) mdl_vec(:)=anl_vec(t+1,:) model = l96_model_type(bkg_config, state=mdl_vec(:), step=t) call model%adv_nsteps(1) model_ADJ = l96_adj_type(bkg_config, state=model%get_state(), trajectory=model%get_state() - mdl_vec(:), step = t + 1) call model_ADJ%adv_nsteps(11) mdl_vec(:) = model_ADJ%get_state() end if if(mthd < 4) ges_vec(:)=mdl_vec(:) if(mthd > 3) ges_vec(:)=0.5*(tlm_vec(t,:)+mdl_vec(:)) tmp_vec(:)=(ges_vec(:)-tim_bkv(:))*(B+Q) dif_vec(:)=(ges_vec(:)-tim_bkv(:))*(B+Q) call pre_con_dif(tim_bkc,tmp_vec) pre_dif(:)=tmp_vec(:) call dgemv("N", bkg_len, bkg_len, 1.d0, tim_hrh, bkg_len, dif_vec, 1, 0.d0, tmp_vec, 1) tmp_vec(:)=pre_dif(:)+tmp_vec(:)-tim_htr(:) call dgemv("N", bkg_len, bkg_len, 1.d0, tim_bkc, bkg_len, tmp_vec, 1, 0.d0, grd_jvc, 1) new_vec(t,:)=ges_vec(:)-grd_jvc(:)*alph if(mthd > 4) tlm_vec(t,:)=new_vec(t,:) end do !$OMP END PARALLEL DO
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
You should be able to do something like this:
subroutine yourSubroutine(yourArgs) ... !$OMP PARALLEL DO SHARED do t=tim_len,1,-1 call containedSubroutine(t) end do ... !$OMP END PARALLEL DO contains subroutine containedSubroutine(t) implicit none integer :: t ! declare all private entities here real, allocatable :: tim_bkc(:,:), tim_hrh(:,:), tim_htr(:), tim_bkv(:) ... ! leave shared variables undefined here ... tim_bkc(:,:)=bkg_cov(t,:,:) tim_hrh(:,:)=brh_cov(t,:,:) tim_htr(:)=bht_ino(t,:,1) tim_bkv(:)=bkg_vec(t,:) if(t.eq.tim_len) then mdl_vec(:)=tlm_vec(t,:) if(mthd > 3) mdl_vec(:)=0.5*(tlm_vec(t,:)+anl_vec(t,:)) else if (mthd.eq.3) mdl_vec(:)=tlm_vec(t+1,:) if (mthd > 3) mdl_vec(:)=anl_vec(t+1,:) model = qg_model_type(bkg_config, state_vector=mdl_vec(:), step=t) call model%adv_nsteps(1) model_ADJ = qg_adj_type(bkg_config, state_vector=model%get_state_vector(), trajectory_vector=model%get_state_vector() - mdl_vec(:), step = t + 1) call model_ADJ%adv_nsteps(4) mdl_vec(:) = model_ADJ%get_state_vector() end if if(mthd < 4) ges_vec(:)=mdl_vec(:) if(mthd > 3) ges_vec(:)=0.5*(tlm_vec(t,:)+mdl_vec(:)) tmp_vec(:)=(ges_vec(:)-tim_bkv(:))*(B+Q) dif_vec(:)=(ges_vec(:)-tim_bkv(:))*(B+Q) call pre_con_dif(tim_bkc,tmp_vec) pre_dif(:)=tmp_vec(:) call dgemv("N", bkg_len, bkg_len, 1.d0, tim_hrh, bkg_len, dif_vec, 1, 0.d0, tmp_vec, 1) tmp_vec(:)=pre_dif(:)+tmp_vec(:)-tim_htr(:) call dgemv("N", bkg_len, bkg_len, 1.d0, tim_bkc, bkg_len, tmp_vec, 1, 0.d0, grd_jvc, 1) new_vec(t,:)=ges_vec(:)-grd_jvc(:)*alph if(mthd < 4) tlm_vec(t,:)=new_vec(t,:) end subroutine containedSubroutine end subroutine yourSubroutine
*** the above is in line with how you have your arrays layed out.
I strongly suggest that you consider rearranging the partitioning index t, to be the last index of your arrays that are partitioned by t.
By doing so, in many places you can avoid upon entry the copying from the t'th slice of an array through use of an expensive gather operation
and then prior to exit, avoid copying to the t'th slice of an array through use of an expensive scatter operation.
The copying of the data is likely avoidable.
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Virtually all CPUs you will run your code on will have SIMD instructions. Write your code to take advantage of this.
if(mthd > 3) mdl_vec(:)=0.5*(tlm_vec(t,:)+anl_vec(t,:))
will not vectorize (or not without gather).
if(mthd > 3) mdl_vec(:)=0.5*(tlm_vec(:,t)+anl_vec(:,t)) ! requires swapping array indexes
will vectorize
A similar situation for all the other array slices you have.
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Thank you for pointing out the bad indexing performance issue. A colleague pointed that out to me a while ago, and I got busy with other things and forgot to fix it. How embarrassing that I've put it online for everyone to see! I'm working on reordering things and using your suggestion of a subroutine for the contents of the loops. Thank you for such a quick response and suggestion.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
It is easy for a new Fortran programmer to not be used to the indexing due to C/C++/C# being the other way around. C is that way due to it not having an array descriptor and not having multi-dimension arrays. It uses an array of pointers to data or array of pointers for the next index, ...
In your example, changing the indexing could potentially gain from 8x to 16x the efficiency of use of the cache system of your CPU. Any 'pain" in coding effort has a relatively high payback in gains.
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I want to follow up and thank you for your prompt assistance. I was successful in creating subroutines, as you directed, to contain the contents of each of my parallel loops. The compiler was then able to parallelize those loops and timings verify that it is running well in parallel. The indexing was also corrected. I have no further questions on this issue. Thanks again for your help.
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page