Solved: Refuses to thread OpenMP loop

Christopher_H_1 · ‎11-03-2017

Hi,

I have two OpenMP loops that ifort (Version 16.0.1.150 Build 20151021) refuses to parallelize. I've added "-qopenmp -opt-report-file=stdout -opt-report-phase:openmp " to my compile/link commands and see this in the output:

Begin optimization report for: MODULE_VARSOLVER_QG::VAR_SOLVER

Report from: OpenMP optimizations [openmp]

Optimization 'PRE' reduced: function size or variable count limit exceeded: use -override_limits / -Qoverride_limits to override
===========================================================================

I am not sure what that is telling me. Is it saying that the function/subroutine calls inside the loop are too big for the compiler to generate threaded code? If I add the " -override_limits" option, which does not appear to be documented in the man page or anywhere online, the error goes away, but I also do not get an OpenMP optimization report for those loops. I have previously never encountered any limits to the "stuff" one can put inside a threaded region before that would cause this, and have seen functioning OpenMP loops with huge amounts of stuff in them.

I have included one of the the loops for reference even though it's probably too complicated because I haven't yet reduced it to a small "reproducer".

!$OMP PARALLEL DO SHARED (bkg_len,bht_ino,bkg_cov,brh_cov,bkg_vec,anl_vec,tlm_vec,new_vec,tim_len,alph,mthd,B,Q,bkg_config) DEFAULT(PRIVATE)
       do t=tim_len,1,-1
          tim_bkc(:,:)=bkg_cov(t,:,:)
          tim_hrh(:,:)=brh_cov(t,:,:)
          tim_htr(:)=bht_ino(t,:,1)
          tim_bkv(:)=bkg_vec(t,:)
          if(t.eq.tim_len) then
             mdl_vec(:)=tlm_vec(t,:)
             if(mthd > 3) mdl_vec(:)=0.5*(tlm_vec(t,:)+anl_vec(t,:))
          else
             if (mthd.eq.3) mdl_vec(:)=tlm_vec(t+1,:)
             if (mthd > 3) mdl_vec(:)=anl_vec(t+1,:)
             model = qg_model_type(bkg_config, state_vector=mdl_vec(:), step=t)
             call model%adv_nsteps(1)
             model_ADJ = qg_adj_type(bkg_config, state_vector=model%get_state_vector(), trajectory_vector=model%get_state_vector() - mdl_vec(:), step = t + 1)
             call model_ADJ%adv_nsteps(4)
             mdl_vec(:) = model_ADJ%get_state_vector()
          end if
          if(mthd < 4) ges_vec(:)=mdl_vec(:)
          if(mthd > 3) ges_vec(:)=0.5*(tlm_vec(t,:)+mdl_vec(:))
          tmp_vec(:)=(ges_vec(:)-tim_bkv(:))*(B+Q)
          dif_vec(:)=(ges_vec(:)-tim_bkv(:))*(B+Q)

          call pre_con_dif(tim_bkc,tmp_vec)
          pre_dif(:)=tmp_vec(:)

          call dgemv("N", bkg_len, bkg_len, 1.d0, tim_hrh, bkg_len, dif_vec, 1, 0.d0, tmp_vec, 1)

          tmp_vec(:)=pre_dif(:)+tmp_vec(:)-tim_htr(:)

          call dgemv("N", bkg_len, bkg_len, 1.d0, tim_bkc, bkg_len, tmp_vec, 1, 0.d0, grd_jvc, 1)
          new_vec(t,:)=ges_vec(:)-grd_jvc(:)*alph

          if(mthd < 4) tlm_vec(t,:)=new_vec(t,:)
       end do
!$OMP END PARALLEL DO

jimdempseyatthecove · ‎11-03-2017

You should be able to do something like this:

subroutine yourSubroutine(yourArgs)
  ...
!$OMP PARALLEL DO SHARED
  do t=tim_len,1,-1
    call containedSubroutine(t)
  end do
  ...
!$OMP END PARALLEL DO                                                                                          
  contains
  subroutine containedSubroutine(t)
    implicit none
    integer :: t
    ! declare all private entities here
    real, allocatable :: tim_bkc(:,:), tim_hrh(:,:), tim_htr(:), tim_bkv(:)
    ...
    ! leave shared variables undefined here
    ...

    tim_bkc(:,:)=bkg_cov(t,:,:)
    tim_hrh(:,:)=brh_cov(t,:,:)
    tim_htr(:)=bht_ino(t,:,1)
    tim_bkv(:)=bkg_vec(t,:)
    if(t.eq.tim_len) then
      mdl_vec(:)=tlm_vec(t,:)
      if(mthd > 3) mdl_vec(:)=0.5*(tlm_vec(t,:)+anl_vec(t,:))
    else
      if (mthd.eq.3) mdl_vec(:)=tlm_vec(t+1,:)
      if (mthd > 3) mdl_vec(:)=anl_vec(t+1,:)
      model = qg_model_type(bkg_config, state_vector=mdl_vec(:), step=t)
      call model%adv_nsteps(1)
      model_ADJ = qg_adj_type(bkg_config, state_vector=model%get_state_vector(), trajectory_vector=model%get_state_vector() - mdl_vec(:), step = t + 1)
      call model_ADJ%adv_nsteps(4)
      mdl_vec(:) = model_ADJ%get_state_vector()
    end if
    if(mthd < 4) ges_vec(:)=mdl_vec(:)
    if(mthd > 3) ges_vec(:)=0.5*(tlm_vec(t,:)+mdl_vec(:))
    tmp_vec(:)=(ges_vec(:)-tim_bkv(:))*(B+Q)
    dif_vec(:)=(ges_vec(:)-tim_bkv(:))*(B+Q)

    call pre_con_dif(tim_bkc,tmp_vec)
    pre_dif(:)=tmp_vec(:)

    call dgemv("N", bkg_len, bkg_len, 1.d0, tim_hrh, bkg_len, dif_vec, 1, 0.d0, tmp_vec, 1)

    tmp_vec(:)=pre_dif(:)+tmp_vec(:)-tim_htr(:)

    call dgemv("N", bkg_len, bkg_len, 1.d0, tim_bkc, bkg_len, tmp_vec, 1, 0.d0, grd_jvc, 1)
    new_vec(t,:)=ges_vec(:)-grd_jvc(:)*alph

    if(mthd < 4) tlm_vec(t,:)=new_vec(t,:)
   end subroutine containedSubroutine
end subroutine yourSubroutine

*** the above is in line with how you have your arrays layed out.

I strongly suggest that you consider rearranging the partitioning index t, to be the last index of your arrays that are partitioned by t.

By doing so, in many places you can avoid upon entry the copying from the t'th slice of an array through use of an expensive gather operation
and then prior to exit, avoid copying to the t'th slice of an array through use of an expensive scatter operation.
The copying of the data is likely avoidable.

Jim Dempsey

View solution in original post

TimP · ‎11-03-2017

I've had cases in the past where Intel OpenMP failed due apparently to too many private variables. I might suspect the limit may depend on target mode ISA.

If you're curious, you might check whether the suggestion to try override_limits is associated with spills. The "cure" may not improve things.

Christopher_H_1 · ‎11-03-2017

Interesting. I think it must have something to do with qg_model_type and qg_adj_type. Below is a loop that is identical to the one in my original post except that the l96_model_type and l96_adj_type is being used instead of qg_model_type and qg_adj_type. The latter types contain much more data than the former ones. And the subroutines for the latter are also much bigger than the former ones. But, the compiler has no problem with the l96 versions as shown below. And with that one, I get:

Begin optimization report for: MODULE_VARSOLVER_L96::VAR_SOLVER

Report from: OpenMP optimizations [openmp]

module_varsolver_l96.f90(597:7-597:7):OMP:module_varsolver_l96_mp_var_solver_: OpenMP DEFINED LOOP WAS PARALLELIZED

!$OMP PARALLEL DO SHARED (bkg_len,bht_ino,bkg_cov,brh_cov,bkg_vec,anl_vec,tlm_vec,new_vec,tim_len,alph,mthd,B,Q,bkg_config) DEFAULT(PRIVATE)
       do t=tim_len,1,-1
          tim_bkc(:,:)=bkg_cov(t,:,:)
          tim_hrh(:,:)=brh_cov(t,:,:)
          tim_htr(:)=bht_ino(t,:,1)
          tim_bkv(:)=bkg_vec(t,:)         
          if(t.eq.tim_len) then 
             mdl_vec(:)=tlm_vec(t,:)
             if(mthd.eq.4) mdl_vec(:)=0.5*(tlm_vec(t,:)+anl_vec(t,:))
          else 			
             if (mthd.eq.3) mdl_vec(:)=tlm_vec(t+1,:)
             if (mthd.eq.4) mdl_vec(:)=anl_vec(t+1,:)
             model = l96_model_type(bkg_config, state=mdl_vec(:), step=t)
             call model%adv_nsteps(1)
             model_ADJ = l96_adj_type(bkg_config, state=model%get_state(), trajectory=model%get_state() - mdl_vec(:), step = t + 1)
             call model_ADJ%adv_nsteps(11)
             mdl_vec(:) = model_ADJ%get_state()
          end if
          if(mthd < 4) ges_vec(:)=mdl_vec(:)
          if(mthd > 3) ges_vec(:)=0.5*(tlm_vec(t,:)+mdl_vec(:))
          tmp_vec(:)=(ges_vec(:)-tim_bkv(:))*(B+Q)
          dif_vec(:)=(ges_vec(:)-tim_bkv(:))*(B+Q)

          call pre_con_dif(tim_bkc,tmp_vec)
          pre_dif(:)=tmp_vec(:)

          call dgemv("N", bkg_len, bkg_len, 1.d0, tim_hrh, bkg_len, dif_vec, 1, 0.d0, tmp_vec, 1)

          tmp_vec(:)=pre_dif(:)+tmp_vec(:)-tim_htr(:)

          call dgemv("N", bkg_len, bkg_len, 1.d0, tim_bkc, bkg_len, tmp_vec, 1, 0.d0, grd_jvc, 1)
          new_vec(t,:)=ges_vec(:)-grd_jvc(:)*alph

          if(mthd > 4) tlm_vec(t,:)=new_vec(t,:)
       end do
!$OMP END PARALLEL DO

jimdempseyatthecove · ‎11-03-2017

You should be able to do something like this:

subroutine yourSubroutine(yourArgs)
  ...
!$OMP PARALLEL DO SHARED
  do t=tim_len,1,-1
    call containedSubroutine(t)
  end do
  ...
!$OMP END PARALLEL DO                                                                                          
  contains
  subroutine containedSubroutine(t)
    implicit none
    integer :: t
    ! declare all private entities here
    real, allocatable :: tim_bkc(:,:), tim_hrh(:,:), tim_htr(:), tim_bkv(:)
    ...
    ! leave shared variables undefined here
    ...

    tim_bkc(:,:)=bkg_cov(t,:,:)
    tim_hrh(:,:)=brh_cov(t,:,:)
    tim_htr(:)=bht_ino(t,:,1)
    tim_bkv(:)=bkg_vec(t,:)
    if(t.eq.tim_len) then
      mdl_vec(:)=tlm_vec(t,:)
      if(mthd > 3) mdl_vec(:)=0.5*(tlm_vec(t,:)+anl_vec(t,:))
    else
      if (mthd.eq.3) mdl_vec(:)=tlm_vec(t+1,:)
      if (mthd > 3) mdl_vec(:)=anl_vec(t+1,:)
      model = qg_model_type(bkg_config, state_vector=mdl_vec(:), step=t)
      call model%adv_nsteps(1)
      model_ADJ = qg_adj_type(bkg_config, state_vector=model%get_state_vector(), trajectory_vector=model%get_state_vector() - mdl_vec(:), step = t + 1)
      call model_ADJ%adv_nsteps(4)
      mdl_vec(:) = model_ADJ%get_state_vector()
    end if
    if(mthd < 4) ges_vec(:)=mdl_vec(:)
    if(mthd > 3) ges_vec(:)=0.5*(tlm_vec(t,:)+mdl_vec(:))
    tmp_vec(:)=(ges_vec(:)-tim_bkv(:))*(B+Q)
    dif_vec(:)=(ges_vec(:)-tim_bkv(:))*(B+Q)

    call pre_con_dif(tim_bkc,tmp_vec)
    pre_dif(:)=tmp_vec(:)

    call dgemv("N", bkg_len, bkg_len, 1.d0, tim_hrh, bkg_len, dif_vec, 1, 0.d0, tmp_vec, 1)

    tmp_vec(:)=pre_dif(:)+tmp_vec(:)-tim_htr(:)

    call dgemv("N", bkg_len, bkg_len, 1.d0, tim_bkc, bkg_len, tmp_vec, 1, 0.d0, grd_jvc, 1)
    new_vec(t,:)=ges_vec(:)-grd_jvc(:)*alph

    if(mthd < 4) tlm_vec(t,:)=new_vec(t,:)
   end subroutine containedSubroutine
end subroutine yourSubroutine

*** the above is in line with how you have your arrays layed out.

I strongly suggest that you consider rearranging the partitioning index t, to be the last index of your arrays that are partitioned by t.

By doing so, in many places you can avoid upon entry the copying from the t'th slice of an array through use of an expensive gather operation
and then prior to exit, avoid copying to the t'th slice of an array through use of an expensive scatter operation.
The copying of the data is likely avoidable.

Jim Dempsey

jimdempseyatthecove · ‎11-03-2017

Virtually all CPUs you will run your code on will have SIMD instructions. Write your code to take advantage of this.

if(mthd > 3) mdl_vec(:)=0.5*(tlm_vec(t,:)+anl_vec(t,:))

will not vectorize (or not without gather).

if(mthd > 3) mdl_vec(:)=0.5*(tlm_vec(:,t)+anl_vec(:,t)) ! requires swapping array indexes

will vectorize

A similar situation for all the other array slices you have.

Jim Dempsey

Christopher_H_1 · ‎11-03-2017

Thank you for pointing out the bad indexing performance issue. A colleague pointed that out to me a while ago, and I got busy with other things and forgot to fix it. How embarrassing that I've put it online for everyone to see! I'm working on reordering things and using your suggestion of a subroutine for the contents of the loops. Thank you for such a quick response and suggestion.

jimdempseyatthecove · ‎11-04-2017

It is easy for a new Fortran programmer to not be used to the indexing due to C/C++/C# being the other way around. C is that way due to it not having an array descriptor and not having multi-dimension arrays. It uses an array of pointers to data or array of pointers for the next index, ...

In your example, changing the indexing could potentially gain from 8x to 16x the efficiency of use of the cache system of your CPU. Any 'pain" in coding effort has a relatively high payback in gains.

Jim Dempsey

Christopher_H_1 · ‎11-06-2017

I want to follow up and thank you for your prompt assistance. I was successful in creating subroutines, as you directed, to contain the contents of each of my parallel loops. The compiler was then able to parallelize those loops and timings verify that it is running well in parallel. The indexing was also corrected. I have no further questions on this issue. Thanks again for your help.