Solved: !dir$ loop count vs !dir$ parallel

Frankcombe__Kim · ‎06-02-2015

I'm working through my old code to improve its performance and running GAP. The reports often suggest I add !dir$ loop count min(XXX) to the code. Although I reckon that more often than not the loop count will exceed XXX, I cant guarantee it. I have therefore added IF loops to the code along the lines of the following example

Assume A and B are vectors of the same size

IF (SIZE(A) > 255)

!DIR$ LOOP COUNT MIN(256)

B=A

ELSE

B=A

END IF

This strikes me as pretty ugly and will have a small performance penalty as the IF loop is tested.

Alternately I could write

!DIR$ PARALLEL ALWAYS

B=A

Is there a better way and which approach to coding is considered the best? Will the computation provide the wrong answer if I set loop count min and at run time there are less iterations to do or will it just be inefficient?

Cheers

Kim

Steven_L_Intel1 · ‎06-02-2015

If the count is too low, your program will still run but parallelization/vectorization may be inefficient for that loop.

View solution in original post

TimP · ‎06-02-2015

My guess would be to set realistic min and avg values. If you are willing to dig deeper, you would look to see if you are getting an effective combination of vectorization and parallel.

openmp offers more explicit control with simd and if clauses, but would require do loops unless the recent improvements in workshare take hold.

Frankcombe__Kim · ‎06-02-2015

Thanks Tim.

For some code I can have some feel for realistic minima but in other cases the routines are called by many different routines doing all sorts of different tasks, so the array size will be quite variable and the range will be several orders of magnitude either side of the average.e.g sorting arrays. I work with data sets ranging in size from a few Kb to >10 Gb. I inferred from the messages in the GAP reports that if I set the loop count to a smaller number than the one recommended that the Auto Parallel would not work.

I guess the critical question is - If I set loop count min(256) at compile time and only have 20 iterations at run time will it fail or just not be parallel? Happy if the latter not so the former.

Cheers

Kim

Steven_L_Intel1 · ‎06-02-2015

If the count is too low, your program will still run but parallelization/vectorization may be inefficient for that loop.

Frankcombe__Kim · ‎06-02-2015

Thanks Steve

That's what I wanted to hear. I'll go back through and remove all the IF loops I added today. At < 256 iterations I can handle inefficient if I get efficiency at 1.0E6 iterations.

Cheers

Kim

jimdempseyatthecove · ‎06-03-2015

Kim,

Have you considered replacing

A = B

with

CALL YourFastArrayCopy(A,B)

where all the "ugliness" of the IF test can be hidden inside the subroutine. Change "YourFastArrayCopy" to a name of your liking and arrange the in array and out array in the order you want.

Forcing a parallel copy on small arrays is horrible, forcing perhaps a large number of threads on medium arrays isn't efficient, and forcing more threads than the memory subsystem can handle is detrimental to the threading resources available for other areas of your program as well as other applications on your system.

Jim Dempsey

Frankcombe__Kim · ‎06-03-2015

Jim

I hadn't considered your approach. In the case of this code I'm not sure it would be practical as many of the S/R are 4-5 liners just doing an assignment or a vector swap/copy.They in turn are called by many other routines.

I'm experimenting to some extent to see if I can improve performance and at the moment I'm working through the Numerical recipes code as that is used extensively by my other routines. You may think that spending time on the NR code is a triumph of stubbornness over common sense but compared to the MKL routines it is well documented and I have the source code so am not locked to a platform or compiler and can pull it apart to understand what is happening. I also don't have to worry about the stack issues with MKL as I push any non allocated arrays onto the heap. I'm also using it as a learning exercise albeit perhaps not a particularly efficient one.

I've found that the GAP reports are not as useful as they might be. I'll get a message to insert a loop count min() directive then when I re-run GAP it tells me I can't put loop count min() there because it is an array rather than a vector or inside a WHERE loop or some other issue so I'm learning to recognise when to insert the directives and when to ignore the guide. The report doesn't actually tell me I can't do it on arrays but after working through the "manual" I've come to that conclusion, likewise with intrinsic functions - hopefully because they are automatically optimised although the GAP report is not aware of that.

Only another 1000 lines to go before I finish the NR code I use so I'll recompile at that point and compare run times with the existing version for both small and large arrays. If it doesn't help I can always just overwrite it all with the version I had a few days ago and get back to doing something more productive and which pays the bills.:-) I'd certainly hope to see an improvement and if I pay a small price for small arrays then it isn't such a worry 256 iterations is hardly worth getting up for so I'm expecting it won't happen often.

If it does help then I'll also do the handful of NSWL and IMSL routines I use as those are F77 and currently very poorly optimised. Again I hear groaning about unproductive work but consider it holiday relaxation:-}

Cheers

Kim

John_Campbell · ‎06-08-2015

Kim,

256 cycles appears to be a very small loop for any parallel computation savings. It is worth estimating the overhead of initiating a !dir$ parallel ( I assume effectively !$OMP PARALLEL DO ?), which I estimate is of the order of 2-10 microseconds. This is equivalent to about 20,000 processor cycles, so the loop saving needs to be substantial. I think that Openmp.org does provide a benchmark suite to estimate these delays for your environment, as well as a useful demonstration of a number of parallel coding structures.

It is also worth considering how many times this parallel region is initiated, as if it is a few, there is not much performance change either way, but if say 10^8 cycles would involve an overhead of 500 seconds. It is recommended that !$OMP applies to the outer loop rather than the inner loop, to limit this effect, while vectorization of inner loops is the more effective approach.

It always helps to do a rough estimate of the likely overhead and benefit when applying any form of parallel approach. The good thing is that once you have installed the code changes, it can simply be changing a compiler switch to contrast the effect of these changes and learn more about their suitability for a few real examples.

John

Frankcombe__Kim · ‎06-08-2015

Thanks John

256 was the minimum number suggested by the guided auto parallel report. The number at run time could vary from as few as 20 to several million with a median up in the several thousands.

At this stage I was trying to keep things simple and just use the Intel auto parallel options giving it some help with compiler directives. If that does not make a lot of difference or causes problems I'll go to OpenMP but I wanted to walk before running.

Cheers

Kim

TimP · ‎06-08-2015

Writing separate parallel and serial versions, Cilk(tm) Plus style, hardly looks like keeping it simple, e.g. from netlib vectors benchmark s176:

        if(m > 201)cilk_for (int i__ = 1; i__ <= m; ++i__)
            a[i__] += __sec_reduce_add(b[i__:m]*c__);
        else for (int i__ = 1; i__ <= m; ++i__)
            a[1:m] += b[1+m-i__:m]*c__[i__];

vs. ifort with OpenMP 2:

!$omp parallel do if(m>201)

do i= 1,m
a(i)= a(i)+dot_product(b(i:i+m-1),c(m:1:-1))

The point gap makes is well taken that a large number of outer loop iterations may be needed before threading should be considered. If the compiler default guesses imply a number which is too small, the compiler may be justified in deciding not to parallelize. When you cover a range of cases where you need both multi- and single-threaded, OpenMP fits well. You would need to examine whether auto-parallel with loop count min() and avg() has encouraged the compiler to generate both single- and multi-threaded versions, if that is your goal, but that may be more effort than using OpenMP.

If parallelization has to be done at the expense of vectorization (this case is such a one when compiled with gfortran/gcc/MSVC), it may never pay off, although some people still brag about speedup over non-vector serial code. In the case above, the parallel-vector reduction version gives marginally better accuracy; that is in fact connected with the difficulty many compilers have with optimization.

Frankcombe__Kim · ‎06-11-2015

For the benefit of anyone following this thread I thought I'd post my results.

After spending about a week (not solid) working through the ~100 source code files and adding compiler directives where appropriate I recompiled the program and took a 3 million point database made up of 285 lines (groups) of data ranging in length from 400 to 35,000 points with an average around 9000 points. I compared code compiled with /O3, /O3 /Qpar and /O3 /Qpar with !dir$ compiler directives recommended by the auto-guide.

I ran an FFT using the IMSL F77 routine dfftci and a butterworth using some internal code. In adding the compiler directives to the old IMSL code I also re-ordered some of the nested loops to improve vectorisation and changed the way that some of the array indices were updated to remove interdependencies but I suspect there is a lot more that could be done and if I used more of their routines I would just buy the F95 version and stop messing about.

There was a small speed up in each case, with the majority of the time spent in disk I/O. The CPU utilisation as shown by the task manager was greater in each case going from non parallel to parallel with compiler directives (not surprising). Obviously the non parallel version was single core but interestingly the auto parallel version without compiler directives only used 3 cores at 40% each doing a butterworth filter on the lines while the version with compiler directives used all 8 cores (hyperthreaded) at 100% for the same operation. Most importantly however the answers for the three runs were the same. The only negative was that the executable grew from 8.4 Mb under /O3 and /O3 /Qpar to 9 Mb under /O3 /Qpar with compiler directives, compilation time also increased noticeably when adding /Qpar to the compiler string.

Thanks for the advice.

Cheers

Kim

John_Campbell · ‎06-14-2015

Kim,

Your reporting of 40% cpu may not be due to only using 3 cores, but could be from using 8 cores for 30% of the time and 1 core for 70% of the time. You may have a mix of parallel and single thread calculations taking place during the reporting interval of task manager.

My experience with OpenMP has been with a few localised calculations. I record the number of !$OMP region entries and try to relate that to the run time saving and the estimate of overhead when initiating each parallel region entry. I also recorded stats based on int(log(op_count)) to profile the calculations. This calculation did help me better understand what was happening and improve the approach.

I also tried to introduce code that selected serial code if the loop run time was estimated to be too small, but I found for my case, this was only a small proportion of the total calculations, so that the added mess of duplicated code didn't achieve a noticeable run time improvement. I scrapped this approach.

My solution was to identify a good way to group the calculations that minimised the !$OMP entries (overhead). I also found that SCHEDULE(DYNAMIC) did provide some improvement for variable thread loads, although these potential improvements can be difficult to measure. My final approach is listed below, where I explicitly defined the use of all variables in the parallel region. This approach helped me to identify what is and is not happening.

!$OMP  PARALLEL DO                              &
!$OMP& PRIVATE (Ieq,Leq,JB,I0,op_n)             &
!$OMP& SHARED  (A,NA_dia,NA_leq,IA,JBOT,JTOP,no_calc) &
!$OMP& REDUCTION(+ : op_s)                      &
!$OMP& SCHEDULE(DYNAMIC)

You will see that I have limited myself to fairly basic !$OMP instructions. I have left the use of more complex structures to a time when I understand more of how !$OMP works, but I suspect it is probably better to think about the algorithm changes, rather than complex locking structures in !$OMP. I had to change the algorithm approach and find a different way of grouping the calculations that better suited parallel threads.

John