OpenMP internal compiler limits

David_DiLaura1 · ‎10-24-2014

Steve, Colleagues,

I am working with a relatively large program (~9000 lines of code) with deeply-nested loops. I have never been able to get the compiler to act on the !$omp parallel do directive -- even for single, tight loops at the beginning of the code. No messages, no warnings. Just no paralllelization. Otherwise, the generated code is fine. The optimization report records no loop parallelization and running with VTune verifies that. I observe that the compiler DOES perform vectorization in this large program.

OpenMP parallel do directives (and vectorization) ARE implemented in some of the smaller subroutines this program calls.

It occurred to me to try the undocumented compiler option /Qoverride-limits. This allows (apparently) the compiler to implement the parallel do optimization in the large program and it now parallelizes the loops I indicate with !$omp parallel do. Can you provide more information about what is happening? This does not seem to be an available memory limit, but rather something internal to the compiler. Are there other ways around these limits? I realize I'm on my own when using undocumented compiler options, but more information would be helpful.

David

jimdempseyatthecove · ‎10-24-2014

I haven't heard of the symptom you mention... other than for the use omitting to specify that the program is to be compiled with OpenMP enabled .AND. not linked with the OpenMP stubs library. The stub library is use to compile as OpenMP but run as single threaded. This is primarily used as a diagnostic to assure parallel code conforms to serial code. (and has slightly different behavior than running parallel version and library with 1 thread)

Jim Dempsey

TimP · ‎10-24-2014

Speaking of no-longer-documented options, if you have multiple procedures in the big file, I would suggest trying -Qinline-functions-, or (better) splitting procedures into separately compiled source files.

Intel OpenMP has difficulty where there are too many private variables. Possibly, "too many" might be a higher limit in 64-bit compilation.

Sorry if my previous responses show up later.

Steven_L_Intel1 · ‎10-24-2014

What level of optimization report did you enable? I'd expect the higher levels to give you more information. I'd ask that you provide a test case to us at Intel Premier Support so that it can be investigated.

jimdempseyatthecove · ‎10-24-2014

Taking Tim's lead, on the !$OMP PARALLEL... lines add DEFAULT(NONE)

Your initial compile will generate lots of errors. Then on a case by case bases add variables to PRIVATE(...) and SHARED(...) clauses. Use OpenMP continuation lines if necessary. This can greatly reduce the number of shared variables.

Jim Dempsey

David_DiLaura1 · ‎10-24-2014

Jim,

Yup, I've NOT got the project parameters set for OpenMP stubs or disabled -- I've got it set for generating OpenMP code.

I learned a few months ago (from one of your previous posts on this subject) that whenever I build an OpenMP parallel do loop, it necessary to start with what I think is the correct list of shared variables, set default to none, and look at the resulting compiler error list of unspecified variables; and one-by-one placing them in private or shared as appropriate. Even then, it can (usually does) take a while to get the right variables placed in firstprivate, private, or shared. Especially if the loop spans a lot of code and one or two hundred variables are involved. And I've learned to be careful with firstprivate, since that can trigger an awful lot of memory copying at thread-initiation time. All this is my standard rubric now.

BUT, in the case of the large program I'm working on now, without the /Qoverride-limits I don't even get that far.

David

jimdempseyatthecove · ‎10-25-2014

Steve,

re: /override-limits

Is there an undocumented diagnostic option that reports the highest watermark? It may be useful to you if David could use that and report back to you the limit required.

My approach if I were David, and if the process were non-disruptive, would be to identify variables that are suitable for placement in modules, create and USE the module and verify that nothing broke. Then examine the code to locate excise points that are least disruptive yet remove large sections of code, and place this code into a subroutine (that USEs the module) and insert the call. Not seeing the code, I cannot say if this would be wise to do so. David is competent enough to make these decisions.

Not knowing the root cause of the problem it is difficult for me to ascertain what might be the best suggestion for David to use.

Before he whacks up the program, I might suggest he experiment with using full optimizations... but disabling IPO. IPO, though beneficial, is also a resource hog.

Jim Dempsey

Steven_L_Intel1 · ‎10-25-2014

Jim, I am not aware of a specific option or message like that, but I have seen the optimization report give a message that could be helpful. I'll have to check with the optimizer developers to see if they have suggestions that might help.