Compiler doesn't appear to be responding to switches

dehvidc1 · ‎09-03-2010

I'm building a project using the Intel C++ compiler11.1.071 [IA-32] under MSVS2008 on Windows 7.

I'veadded a load of switches through the Configuration Properties dialog.

The Intel compiler doesn't appear to be responding to them.

The command line is:

/c /O3 /Ob2 /Oi /Ot /Qipo /D "WIN32" /D "NDEBUG" /D "_UNICODE" /D "UNICODE" /EHsc /MD /GS /fp:fast /Yu"StdAfx.h" /Fp"Release/Bayes.pch" /Fo"Release/" /W3 /nologo /Zi /Qopenmp /QxHost /Qparallel /Qopt-report:3 /Qpar-report3 /Qopenmp-report2 /Qvec-report5 /Qpar-threshold

I'm not seeing vectorisation reports, openmp reports, parallelisation reports etc. And I'm not seeing any performance improvement which makes me suspect that the optimisation switches aren't being used either.

Any suggestions?

Thanks

David

TimP · ‎09-04-2010

Why not set some basic options which ought to produce and report the optimizations you say you want, e.g.
/O3 /Qparallel /Qopt-report (using the specific properties, rather than explicit switches, where applicable)? I think /Qipo may suppress contents of opt-report. Also, there may be a limit on how many options are observed when entered in the options which might appear to accept multiple command line switches.
If you have OpenMP directives, of course, use /Qopenmp rather than /Qparallel.
Remember that the default options are already aggressive, so improved performance would depend mainly on effective parallelization.

mecej4 · ‎09-04-2010

Earlier versions of the Intel compiler used to issue vectorization reports by default. Recent versions have the default changed; now you must explicitly ask for these reports with a compiler switch (on the compile command line, or in the configuration file ifort.cfg).

levicki · ‎09-05-2010

You don't have a colon before n for par-report, openmp-report, and vec-report but you do for opt-report.

You are using /Qopenmp and /Qparallel together, if I remember correctly you either use OpenMP with /Qopenmp and manually augment the code using OpenMP #pragma directives, or let the compiler work automatically using /Qparallel.

Try using the compiler from the command line to see if you get similar problems.

TimP · ‎09-05-2010

/Qopenmp prevents /Qparallel from working on regions where you have OpenMP directives. If you have the OpenMP directives in all effectively parallelizable regions, then /Qparallel would add nothing.

dehvidc1 · ‎09-05-2010

That's why I have/Qvec-report5 in the command line.

dehvidc1 · ‎09-05-2010

Thanks for that. I don't have any OpenMP directives set. I tried both with and without/Qopenmp as I thought this might be necessary to get the auto-parallelisation working.

I now have the auto-parallelisation working with some simple samples provided by Intel. Today's big adventure is trying to get that working with my (more complicated) code.

dehvidc1 · ‎09-05-2010

Thanks for that.

I just followed the syntax (with-without) colons as given in the compiler documentation. I thought it was a bit odd that the options differed in their syntax.

dehvidc1 · ‎09-05-2010

Thanks for the reply. I did some further work on Saturday and got the vectorisation, auto-parallelisation, OpenMP and at least some of the aggressive optimisation working on some simple examples provided by Intel. The vectorisation, optimisation and auto-parallelisation reporting all produced useful output on this simple code as well. This is a relief in that I now have a base platform working as expected.

My code base is more complicated than these simple examples. So the challenge today is to identify what's different about my codebase/configuration compared to the simple examples.

Given Intel's interest in the supercomputer marketit seems a reasonable assumption that the tools can handle big codebases and more complicated loop instances etc.

Regards

David

Milind_Kulkarni__Int · ‎09-06-2010

Hi,

Imaybehalf-way sure that it may be related to "/Qvec-reportN doesn't work with /Qipo in the IDE , especially in 11.1", and it wouldprobably be fixed in 12.0 release".

You may visit this KB article that may be related to your problem:--

http://software.intel.com/en-us/articles/qvec-reportn-doesnt-work-with-qipo-in-the-ide/

Here, you would also find multiple workarounds to get the vectorization report, if this is your problem, though I thought the other reports should have worked in your code-base, you may try to work with basic options and finding out incrementally which option suppresses the messages.
Mostly, this problem should be related to compiler-options and not the complexities of the code.

TimP · ‎09-06-2010

Some of the options you have tried aren't necessarily applicable to large applications (i.e. much larger than SPEC CPU). That's among the reasons why /Qipo and /Qparallel are only options.
With large applications, it's likely to be necessary to set standards compatibility options, and test (e.g. by profiling) to find out which parts of the application should be built at lower optimization levels (-O1).
Complicated loops still require correct use of restrict keywords, OpenMP (under which the programmer takes more control), and perhaps use of directives to control fusion etc., for optimum results. This has been true of supercomputing even since before the advent of OpenMP.

dehvidc1 · ‎09-06-2010

Thanks for the reply, Tim.

I'm going to go back through the Intel compiler documentation to see what there is in the way of compiler directives regarding loops. I remember noticing something about unrolling and I think blocking.

What did you have in mind when you said "restrict keywords"?

Regards

David

dehvidc1 · ‎09-06-2010

Thanks, Milind. Do you know if there's a reference available that defines the output from the vectorisation, optimisation and parallelisation reporting? I now have 450k lines of very interesting looking output, much of which I can make a reasonable stab at understanding but it would be very useful to have a definitive guide

Regards

David

dehvidc1 · ‎09-06-2010

Thanks for the suggestions.

Using the following command line I nowhave more reporting output than you could poke a big stick at.

/c /O3 /Ob2 /Oi /Ot /D "WIN32" /D "NDEBUG" /D "_UNICODE" /D "UNICODE" /EHsc /MT /GS /fp:fast /Yu"StdAfx.h" /Fp"Release/Bayes.pch" /Fo"Release/" /W3 /nologo /Zi /QxHost /Qparallel /Qopt-report:3 /Qpar-report3 /Qopenmp-report2 /Qvec-report:3 /Qpar-threshold

The /Qpar-threshold is still not recognised. I wonder if this switch has been dropped?

I now have 450k lines of interesting looking output. And the compile tookat least6 hours on atwinquad-core CPU box with 8GB RAM!

What I'm hoping for from the compiler is that it will deliver some useful performance gains from aggressive optimisation, vectorise the dot-product routines that Parallel Amplifier indicates are taking up most of the runttime (otherwise I'll have to handcode some vectorisation using intrinsics or similar) and provide some guidance either way regardingmy initial assessment that the loops involved generally won't parallelise well without some aggressive measures like tiling

I love doing this stuff!

David

TimP · ‎09-06-2010

As we already pointed out, and you apparently agreed, /Qparthreshold without a number has no meaning.
The primary optimization you would want to see for dot products with floating point arguments would be vectorization. Unless your CPU is a recent Intel one, such that /Qxhost translates to SSE4, vectorization would require a unity stride (inner loop moves by 1 on the last subscript, or by pointer incrementing). OpenMP can show big improvements on a loop which contains a dot product as inner loop, but I wouldn't bet on /Qparallel.

dehvidc1 · ‎09-06-2010

Thanks for the pointers about vectorisation, Tim.

Re /Qpar-thresholdthe documentation at:

http://software.intel.com/sites/products/documentation/hpc/compilerpro/en-us/fortran/win/compiler_f/copts/common_options/option_par_threshold.htm

says the default is /Qpar-threshold100 and that"Loops get auto-parallelized only if profitable parallel execution is almost certain. This is also the default if you do not specify n." where n is the number following the switch

I've just tried buildingwith:

/Qpar-threshold100
/Qpar-threshold
/Qpar-threshold:100

All of them have the compiler report:

1>icl: command line remark #10148: option '-Qpar-threshold' not supported.

If I do an icl /help at the command line I can't see the option mentioned anywhere. And it's not in the deprecated options either.

I dunno :)

Regards

David

Milind_Kulkarni__Int · ‎09-06-2010

This option (and a lot of other advanced options) is not supported for Parallel Studio C++ compilers. And the documentation for new upgrade Parallel Composer 2011reflects that.

If you use non-(Parallel Studio) c++compiler, the option will work.

Milind_Kulkarni__Int · ‎09-07-2010

Much of the vectorization guidelines and tips/hints can be found in Compiler User Doc & Reference guide that comes with compiler install, which ( with the help of messages in code) is sufficient to foresee vectorization opportunities in ur code. Search the messages in the doc. See the section on "Vectorization"
Since there are plenty of dot-products in your code, the "restrict keyword" & /Qstd=c99 option will help a lot to disambiguate pointers in function, and extract good performance. Seekeyword in doc.
I hope most of dot-product loops in your code look similar to:--

float xyz(float *a,float *b,int size)
{
float fvar = 0.0f;
for (int i=0;i fvar += a * b;
return fvar;
}

Usually, youshould use openmp pragmas for outerloops (I think you already have that in your code), and inner loops are best suited for vectorization benefit.
There are many cases where compiler would not vectorize like :-- function calls inside loop, dependencies, non-unit strides, and many others which can be found in the documentation, and you can use vector pragmas, keywords, and compiler options to utilizeits benefit.

TimP · ‎09-07-2010

restrict keyword should not be required to optimize dot product, if the dot product is written as STL inner_product(), or, in plain C code, if the result is accumulated explicitly in a local scalar variable (as in Milind's example).