I'veadded a load of switches through the Configuration Properties dialog.
The Intel compiler doesn't appear to be responding to them.
The command line is:
/c /O3 /Ob2 /Oi /Ot /Qipo /D "WIN32" /D "NDEBUG" /D "_UNICODE" /D "UNICODE" /EHsc /MD /GS /fp:fast /Yu"StdAfx.h" /Fp"Release/Bayes.pch" /Fo"Release/" /W3 /nologo /Zi /Qopenmp /QxHost /Qparallel /Qopt-report:3 /Qpar-report3 /Qopenmp-report2 /Qvec-report5 /Qpar-threshold
I'm not seeing vectorisation reports, openmp reports, parallelisation reports etc. And I'm not seeing any performance improvement which makes me suspect that the optimisation switches aren't being used either.
/O3 /Qparallel /Qopt-report (using the specific properties, rather than explicit switches, where applicable)? I think /Qipo may suppress contents of opt-report. Also, there may be a limit on how many options are observed when entered in the options which might appear to accept multiple command line switches.
If you have OpenMP directives, of course, use /Qopenmp rather than /Qparallel.
Remember that the default options are already aggressive, so improved performance would depend mainly on effective parallelization.
You are using /Qopenmp and /Qparallel together, if I remember correctly you either use OpenMP with /Qopenmp and manually augment the code using OpenMP #pragma directives, or let the compiler work automatically using /Qparallel.
Try using the compiler from the command line to see if you get similar problems.
I now have the auto-parallelisation working with some simple samples provided by Intel. Today's big adventure is trying to get that working with my (more complicated) code.
I just followed the syntax (with-without) colons as given in the compiler documentation. I thought it was a bit odd that the options differed in their syntax.
My code base is more complicated than these simple examples. So the challenge today is to identify what's different about my codebase/configuration compared to the simple examples.
Given Intel's interest in the supercomputer marketit seems a reasonable assumption that the tools can handle big codebases and more complicated loop instances etc.
Imaybehalf-way sure that it may be related to "/Qvec-reportN doesn't work with /Qipo in the IDE , especially in 11.1", and it wouldprobably be fixed in 12.0 release".
You may visit this KB article that may be related to your problem:--
Here, you would also find multiple workarounds to get the vectorization report, if this is your problem, though I thought the other reports should have worked in your code-base, you may try to work with basic options and finding out incrementally which option suppresses the messages.
Mostly, this problem should be related to compiler-options and not the complexities of the code.
With large applications, it's likely to be necessary to set standards compatibility options, and test (e.g. by profiling) to find out which parts of the application should be built at lower optimization levels (-O1).
Complicated loops still require correct use of restrict keywords, OpenMP (under which the programmer takes more control), and perhaps use of directives to control fusion etc., for optimum results. This has been true of supercomputing even since before the advent of OpenMP.
I'm going to go back through the Intel compiler documentation to see what there is in the way of compiler directives regarding loops. I remember noticing something about unrolling and I think blocking.
What did you have in mind when you said "restrict keywords"?
Using the following command line I nowhave more reporting output than you could poke a big stick at.
/c /O3 /Ob2 /Oi /Ot /D "WIN32" /D "NDEBUG" /D "_UNICODE" /D "UNICODE" /EHsc /MT /GS /fp:fast /Yu"StdAfx.h" /Fp"Release/Bayes.pch" /Fo"Release/" /W3 /nologo /Zi /QxHost /Qparallel /Qopt-report:3 /Qpar-report3 /Qopenmp-report2 /Qvec-report:3 /Qpar-threshold
The /Qpar-threshold is still not recognised. I wonder if this switch has been dropped?
I now have 450k lines of interesting looking output. And the compile tookat least6 hours on atwinquad-core CPU box with 8GB RAM!
What I'm hoping for from the compiler is that it will deliver some useful performance gains from aggressive optimisation, vectorise the dot-product routines that Parallel Amplifier indicates are taking up most of the runttime (otherwise I'll have to handcode some vectorisation using intrinsics or similar) and provide some guidance either way regardingmy initial assessment that the loops involved generally won't parallelise well without some aggressive measures like tiling
I love doing this stuff!
The primary optimization you would want to see for dot products with floating point arguments would be vectorization. Unless your CPU is a recent Intel one, such that /Qxhost translates to SSE4, vectorization would require a unity stride (inner loop moves by 1 on the last subscript, or by pointer incrementing). OpenMP can show big improvements on a loop which contains a dot product as inner loop, but I wouldn't bet on /Qparallel.
Re /Qpar-thresholdthe documentation at:
says the default is /Qpar-threshold100 and that"Loops get auto-parallelized only if profitable parallel execution is almost certain. This is also the default if you do not specify n." where n is the number following the switch
I've just tried buildingwith:
All of them have the compiler report:
1>icl: command line remark #10148: option '-Qpar-threshold' not supported.
If I do an icl /help at the command line I can't see the option mentioned anywhere. And it's not in the deprecated options either.
I dunno :)
If you use non-(Parallel Studio) c++compiler, the option will work.
Much of the vectorization guidelines and tips/hints can be found in Compiler User Doc & Reference guide that comes with compiler install, which ( with the help of messages in code) is sufficient to foresee vectorization opportunities in ur code. Search the messages in the doc. See the section on "Vectorization"
Since there are plenty of dot-products in your code, the "restrict keyword" & /Qstd=c99 option will help a lot to disambiguate pointers in function, and extract good performance. Seekeyword in doc.
I hope most of dot-product loops in your code look similar to:--
float xyz(float *a,float *b,int size)
float fvar = 0.0f;
for (int i=0;i
Usually, youshould use openmp pragmas for outerloops (I think you already have that in your code), and inner loops are best suited for vectorization benefit.
There are many cases where compiler would not vectorize like :-- function calls inside loop, dependencies, non-unit strides, and many others which can be found in the documentation, and you can use vector pragmas, keywords, and compiler options to utilizeits benefit.