Intel Fortran OpenMP compiler options

Peter_F_5 · ‎12-21-2016

I am running into some issues as the same code yields different results depending on whether I compile it using Visual studio or directly through the command line.

If I compile through Visual Studio these is the default options, to which I add /Qopenmp and a stacksize of 9999999 on project properties:

/nologo /debug:full /Od /Qopenmp /module:"Debug\\" /object:"Debug\\"
/Fd"Debug\vc120.pdb" /traceback /check:bounds /check:stack /libs:dll
threads /dbglibs /c

If I compile directly through command line I use:

ifort /openmp /F999999999 Main.f90

The first option paralellizes but takes 57 hours to run. The second option takes 6 hours. The issue is that the results are different between both and I have no clue why.

mecej4 · ‎12-22-2016

The difference in times to complete execution is caused by your using /check options in one run and not due to using Visual Studio.

The differences in the results indicate a bug in the program or, less likely, a bug in the compiler, or, even less likely, an unstable numerical algorithm. Even compiling with /check does not ensure that all bounds errors will be caught. There are many other types of errors that could be responsible, as well.

Hunting for errors will be more successful if you can run the program with smaller data sets, if that is possible without making the errors stop occurring.

andrew_4619 · ‎12-22-2016

You also have debug full and no optimisation which is always going to be significantly slower.

TimP · ‎12-22-2016

If you have minor differences in numerical results due to auto-vectorization, those optimizations should be suppressed by /fp:source or possibly /Qip- . Major differences due to inconsistent parallelization are the subject of Intel Parallel Inspector.

If your application depends on adherence to Fortran rules on expression evaluation, you should be setting /assume:protect_parens or -standard-semantics.

Peter_F_5 · ‎12-22-2016

Hi again,

Following your answers I run the code with two methods:

1)I started by running the code through command line by doing:
"ifort /openmp /F999999999 Main.f90 "
But the results were different from the original results but not from a huge amount. This took around 6 hours.

2) Then I run in Visual Studio interface using :
"/nologo /O2 /Qopenmp /module:"Release\\" /object:"Release\\" /Fd"Release\vc120.pdb" /libs:dll /threads /c"
With this the results are identical to the original debug version, but it takes more than 10 hours.

So it seems that with the second version I get the results I want but at a cost of a much slower speed. Any advice?

Steven_L_Intel1 · ‎12-22-2016

On the command line build you changed the stack reserve size to one that is closing in on 1GB. This dramatically changes the in-memory layout of the executable and thus may cause uninitialized variables to get different values. It would be interesting to see what happens if you bring that value down or even omit it. You don't show the link options, though - maybe you set the stack reserve there in VS.

My guess is that you have an uninitialized variable that is changing how quickly your algorithm converges (and also changing results.)

Peter_F_5 · ‎12-22-2016

Hi Steve. Thanks for the answer.

Below the stack reserve from VS (I do set the stack reserve there). If I set the stack reserve to anything lower I get a StackOverflow error.

/OUT:"Release\Benchmark.exe" /INCREMENTAL:NO 
/NOLOGO /MANIFEST /MANIFESTFILE:"Release\Benchmark.exe.intermediate.manifest" 
/MANIFESTUAC:"level='asInvoker' uiAccess='false'" /DEBUG 
/PDB:"D:\Benchmark.pdb" /SUBSYSTEM:CONSOLE /STACK:999999999 /IMPLIB:"D:\Benchmark\Release\Benchmark.lib"

I have checked the code many many times, and doubt there is an unitialized variable. But it is possible. In any case given that I set the stack in both ways of compiling, there should be no evident reason for the slowdown. Or am I missing something?

mecej4 · ‎12-22-2016

A minor point: the stack has to be aligned and sized to suit the processor word size. For IA/32 that means that the stack base and the stack size should be a multiple of 4 bytes. For x64, a multiple of 8 bytes. You specified an extension of the Herman Cain mantra (9-9-9), which is not suitable.

The start-up code in the language runtime probably fixes these things up. However, it is good to be aware of this issue.

Steven_L_Intel1 · ‎12-22-2016

At this point I suggest instrumenting your code to see where it starts to diverge.

IanH · ‎12-22-2016

mecej4 wrote:

A minor point: the stack has to be aligned and sized to suit the processor word size. For IA/32 that means that the stack base and the stack size should be a multiple of 4 bytes. For x64, a multiple of 8 bytes. You specified an extension of the Herman Cain mantra (9-9-9), which is not suitable.

The start-up code in the language runtime probably fixes these things up. However, it is good to be aware of this issue.

The specified stack size in the executable header will be rounded up to the system's allocation granularity, by the operating system when the exe for the process is loaded. The allocation granularity will be quite a bit larger than the processor word size.

Martyn_C_Intel · ‎12-22-2016

Have you tried watching your executables in Task Manager? Check number of active threads and cores, memory footprint, cpu usage, etc. Does the application built in Visual Studio behave the same way if run from the command line instead?

If you application has data that are private to individual threads, you may need to worry about the thread stack sizes (OMP_STACKSIZE) as well as the main program stack.

When comparing results at different optimization levels, it's good to use /fp:precise (under floating-point in project properties), as Tim indicated. This can help avoid small differences in results due to differences in optimization.

If your OpenMP code contains threaded reductions, these too can result in small variations in results. If your code has race conditions, that can lead to large variations in results, and sometimes a large impact on performance. Do you still see differences in performance or results between Visual Studio and the command line if you run either without threading (/Qopenmp-stubs) or on a single thread (/Qopenmp but the environment variable OMP_NUM_THREADS=1 ?

Intel Inspector is a powerful tool for detecting race conditions, but you'd need to run it on a smaller workload. Likewise for the single thread experiment.