Intel® C++ Compiler
Community support and assistance for creating C++ code that runs on platforms based on Intel® processors.
7944 Discussions

What is wrong with Intel Compiler 11.0

levicki
Valued Contributor I
1,280 Views

This is not a question but a list of issues.

1. Download is huge (707 MB). Downloading new builds results in downloding redundant copies of IPP, TBB and MKL whose versionsdo not neccessarily change each time with the compiler. Can you spell wasted bandwidth?

2. Folder hierarchy is poorly thought out and it keeps changing -- with every new release I have to edit all include/executable/library paths for both 32-bit and 64-bit versions of all tools (ICC, IPP, MKL, TBB) in Visual Studio. That is reallya major inconvenienceconsidering the fact that the latest compiler builds are unstable and/or produce slower codeso I often have to revert to the previous version.

3. Serious regressions aren't addressed quickly enough (#524001 comes to mind).

4. Important behavior changes aren't properly documented in Release Notes (applies to all compiler releases so far,not just 11.0).

5. Both 10.1 and 11.0 produce a bit slower code than 9.1 at least for me. That isn't exactly progress in my book.

Those are my top five. There is more but I won't bother you any further. Thanks for your attention.

0 Kudos
25 Replies
Eric_P_Intel
Employee
1,033 Views

Igor,

I had to submit a couple issues against 11.0.066 today, and I agree with at least your #1 and #2 (the others I haven't tested yet, to be fair). Previously, we asked the compiler team to split the IA32, x64, and IA64versions into three smaller packages, and they did. I expect this 700MB package to be completely unacceptable to someindustry-leading ISVs. (As a manager, would you allow your team ofhundreds ofdevelopers to each spend the time to download and install a 700MB package every time a needed update is released? It may be difficult value proposition at <100MB.)

Now I wish I had found some free time to participate in the 11.0 beta program. FWIW, I plan to start knocking hard on the AVX-enabled 11.x compiler ASAP, hopefully shaking out the most severe issues.

- Eric

0 Kudos
levicki
Valued Contributor I
1,033 Views

Igor,

I had to submit a couple issues against 11.0.066 today, and I agree with at least your #1 and #2 (the others I haven't tested yet, to be fair). Previously, we asked the compiler team to split the IA32, x64, and IA64versions into three smaller packages, and they did. I expect this 700MB package to be completely unacceptable to someindustry-leading ISVs. (As a manager, would you allow your team ofhundreds ofdevelopers to each spend the time to download and install a 700MB package every time a needed update is released? It may be difficult value proposition at <100MB.)

Now I wish I had found some free time to participate in the 11.0 beta program. FWIW, I plan to start knocking hard on the AVX-enabled 11.x compiler ASAP, hopefully shaking out the most severe issues before they are found outside of Intel.

- Eric

Hello Eric. I am glad that someone else agrees at least partially because it proves I am not insane.

I must admit I do not like the direction in which Intel developer tools are heading. It is all starting to look like a big mess.

Download size is not important just for the customers, it should be important to Intel because you pay for hosting bandwidth and making downloads rediculously big and redundant is a waste of corporate money which could be put to better use.

My idea would be to have a package like this one but also to have all the individual packages available for download. Another way to fix it is to use better compression -- for example, I have just packed 1,818 MB of ICC, IPP and TBB bin, include and library files into a 148.6MB 7zip file. Since 7zip is cross-platform and open-source it could be used to package Intel developer tools because the setup you use is already somewhat proprietary in nature.

As for the issue #2, folder hierarchy really needs fixing ASAP. It is too much hassle as it stands now. I am suggesting something like this (I hope the formatting will not be changed by the forum software):

[cpp]C:Program FilesIntelCPP o
|
+----------o bin
| |
| +-o x64
|
+----------o include
| |
| +-o ipp
| |
| +-o mkl
| |
| +-o tbb
|
+----------o lib
|
+-o x64
[/cpp]

That means no version numbers in the path so we don't have to edit all environment variables and Visual Studio paths each time some of the components gets updated. For more details regarding issues #1 and #2 as well as some rationale please take a look at my feature request here -- https://premier.intel.com/premier/IssueDetail.aspx?IssueID=526761.

As for the issue #4, take a look at the discussion in this thread -- http://software.intel.com/en-us/forums/showthread.php?t=61803. Intel could have done this in a more sensible manner than Microsoft and above all the change should have been documented in big red capital letters in the compiler Release Notes instead of letting people discover it by building and shipping executables with unresolved dependencies.

Regarding #5, I will just say that both 10.1 and 11.0 compiler produce faster code than 9.1 for some functions, and slower code than 9.1 for some other functions, so on average the total execution time is the same. Someone would say that is not a valid reason to complain since overall the speed is not affected or the difference is minimal, but I still consider those code slowdowns as a regression for an optimizing compiler.

0 Kudos
TimP
Honored Contributor III
1,033 Views

The most common case of slowdown I have observed, which was introduced in 10.0 and not fixed in 11.0, is with C++ transform() which can be optimized only with #pragma ivdep, where in 9.1 it was sufficient (as with g++) to make use of the restrict extension. Without #ivdep, both MSVC9 and gcc out-perform icpc for those cases. I didn't feel I had a strong argument there, as I don't advocate transform(), which can't be optimized anyway without non-standard extensions.

Another performance issue, which is not an icpc version issue, but a g++ version issue, is that builtins which were introduced in g++ 4.3 aren't optimized by icpc unless #pragma ivdep is set.

For example, theSTL copy() function has become unsatisfactory for icpc. Among the argumentsI've heardhere is that literal reading of the C++ standard indicates thatcopy() shouldn't be implementedas memmove(), asMSVCand g++ do. The conclusion drawn from that is that one shouldn't use copy(), but instead make the appropriate choice between memmove() and memcpy(), or a for() loop. Complicating thesituationare the abysmal versions of memcpy() and memmove() provided by default in glibc for x86_64, prior to version 2.8, and the difficulty of supporting any brand of CPU adequately in that situation. Also, gcc 32-bit by default optimizes memcpy() as an inlineinstruction, which is good for a few cases, but very bad for others, while gcc 64-bit doesn't do that. Both CPU manufacturers have introduced recent optimizations for the built-in string instructions, so they aren't so bad on the latest CPUs.

Current SuSE versions of glibc have back-ported the good functions, but icc doesn't specifically supportmany commonSuSE versions. icc still replaces memcpy() with its own version, unless you take specific precautions to stop that.

So, there are several specific points on which you might wish to submit support issues on premier.intel.com, if you wish to influence the decisions.

0 Kudos
TimP
Honored Contributor III
1,033 Views

The most common case of slowdown I have observed, which was introduced in 10.0 and not fixed in 11.0, is with C++ transform() which can be optimized only with #pragma ivdep, where in 9.1 it was sufficient (as with g++) to make use of the restrict extension. Without #ivdep, both MSVC9 and gcc out-perform icpc for those cases. I didn't feel I had a strong argument there, as I don't advocate transform(), which can't be optimized anyway without non-standard extensions. There still are a few cases where transform() along with compiler dependent additions can produce the best code generation.

Another performance issue, which is not an icpc version issue, but a g++ version issue, is that builtins which were introduced in g++ 4.3 aren't optimized by icpc unless #pragma ivdep is set.

For example, theSTL copy() function has become unsatisfactory for icpc. Among the argumentsI've heardhere is that literal reading of the C++ standard indicates thatcopy() shouldn't be implementedas memmove(), asMSVCand g++ do. The conclusion drawn from that is that one shouldn't use copy(), but instead make the appropriate choice between memmove() and memcpy(), or a for() loop. Complicating thesituationare the abysmal versions of memcpy() and memmove() provided by default in glibc for x86_64, prior to version 2.8, and the difficulty of supporting any brand of CPU adequately in that situation. Also, gcc 32-bit by default optimizes memcpy() as an inlineinstruction, which is good for a few cases, but very bad for others, while gcc 64-bit doesn't do that. Both CPU manufacturers have introduced recent optimizations for the built-in string instructions, so they aren't so bad on the latest CPUs.

Current SuSE versions of glibc have back-ported the good functions, but icc doesn't specifically supportmany commonSuSE versions. icc still replaces memcpy() with its own version, unless you take specific precautions to stop that.

So, there are several specific points on which you might wish to submit support issues on premier.intel.com, if you wish to influence the decisions. In the absence of informed customer input, some decisions may seem sub-optimum.

0 Kudos
TimP
Honored Contributor III
1,033 Views

If you don't want libiomp, don't ask for it. Don't set -parallel or -openmp, don't use the threaded versions of performance libraries. It does seem unlikely that you want these for a 4x4 matrix multiplication. You give the impression of using a lot of options without studying them, or even remembering to mention it.

Both libguide and libiomp have been available as alternates since 10.1. libiomp is the default now, in preparation for future removal of libguide. libiomp supports both the libguide and (in the linux verxion) the libgomp function calls.

Where I have seen a growth in code size with the latest compilers, it is due to more automatic "distribution" (splitting) of large loops which are at least partly vectorizable. I have only been able to speculate; this may improve performance of hyperthreading sometimes. Also, 10.1 sometimes failed to split loops automatically where it was needed; unnecessary splitting is less damaging. The directives may be used to prevent individual loops from distributing. The latest compilers are cutting back on automatic unrolling, possible to contain this growth trend. I have seen as much as a 3 times increase in run time with removal of unrolling; I think that is enough to submit specific problem reports. If none of this is relevant to you, at least I am trying to make the point that comments are useless without specifics.

0 Kudos
srimks
New Contributor II
1,033 Views
Quoting - tim18

If you don't want libiomp, don't ask for it. Don't set -parallel or -openmp, don't use the threaded versions of performance libraries. It does seem unlikely that you want these for a 4x4 matrix multiplication. You give the impression of using a lot of options without studying them, or even remembering to mention it.

Both libguide and libiomp have been available as alternates since 10.1. libiomp is the default now, in preparation for future removal of libguide. libiomp supports both the libguide and (in the linux verxion) the libgomp function calls.

Where I have seen a growth in code size with the latest compilers, it is due to more automatic "distribution" (splitting) of large loops which are at least partly vectorizable. I have only been able to speculate; this may improve performance of hyperthreading sometimes. Also, 10.1 sometimes failed to split loops automatically where it was needed; unnecessary splitting is less damaging. The directives may be used to prevent individual loops from distributing. The latest compilers are cutting back on automatic unrolling, possible to contain this growth trend. I have seen as much as a 3 times increase in run time with removal of unrolling; I think that is enough to submit specific problem reports. If none of this is relevant to you, at least I am trying to make the point that comments are useless without specifics.


Tim.

Probably, you can look the code as enclosed in earlier post http://software.intel.com/en-us/forums/showthread.php?t=62202

The code is designed in such a way, in first phase it does simple matrix multiplication and in another it uses SSE intrinsic functions. My objective is to learn both "Auto-Parallelization & Vectorization (using Compiler directives & SIMD SSE based)" using this code only and play with it. I am new into these areas, so sometimes my reasoning would not be relevant, all is I am learning in this forum.

I did use command "icc -parallel -par-report3 matrix.c", and I am still in a way to understand what is right & wrong through documents & Intel forum.

E.g: In the same code where no "pragma" has been added, I am getting "LOOPS not parallelized" for below sequential code to start with -

for( i = 0; i < 4; i++ )
{
a = (double) (rand() - RAND_MAX/2) / (double) (RAND_MAX );
b = (double) (rand() - RAND_MAX/2) / (double) (RAND_MAX );
c1 = c2 = 0.0;
}

Here my belive probably would be -

The compiler may not know the value of rand() & RAND_MAX, the compiler can assume that iterations may depend on each other. Also, aliasing can happen with "c1 = c2 = 0.0;", so I am getting the "LOOPS not parallelized". I am thinking to try with "#pragma parallel" before the for loop.

Suggest me if you have any insights.

0 Kudos
TimP
Honored Contributor III
1,033 Views

Parallel random number generation is far too complicated to handle here. The simplest answer is to check whether one of the performance libraries (e.g. MKL) has a function such as you want. If you were to have each thread use its own copy of the same generator, using at least private seed and result values, the series used by each thread would differ only to the extent that you seeded them differently. What you have written isn't parallelizable, as it implies that a single random number generator is used. With a fairly simple loop of length 4, the compiler knows anyway that parallelization will slow it down.

0 Kudos
Eric_P_Intel
Employee
1,033 Views
This is a bit off the thread topic, but calls to rand() do depend on previous calls to rand(), and like Tim said, you need a seed for each thread, and you could implement this with OpenMP, but I doubt auto-parallelization is possible.
See this article for an example SIMD RNG:http://software.intel.com/en-us/articles/fast-random-number-generator-on-the-intel-pentiumr-4-processor. I've used this in image-processing algorithms, and it's a huge performance gain.
- Eric
0 Kudos
jimdempseyatthecove
Honored Contributor III
1,033 Views
Quoting - srimks


Tim.

Probably, you can look the code as enclosed in earlier post http://software.intel.com/en-us/forums/showthread.php?t=62202

The code is designed in such a way, in first phase it does simple matrix multiplication and in another it uses SSE intrinsic functions. My objective is to learn both "Auto-Parallelization & Vectorization (using Compiler directives & SIMD SSE based)" using this code only and play with it. I am new into these areas, so sometimes my reasoning would not be relevant, all is I am learning in this forum.

I did use command "icc -parallel -par-report3 matrix.c", and I am still in a way to understand what is right & wrong through documents & Intel forum.

E.g: In the same code where no "pragma" has been added, I am getting "LOOPS not parallelized" for below sequential code to start with -

for( i = 0; i < 4; i++ )
{
a = (double) (rand() - RAND_MAX/2) / (double) (RAND_MAX );
b = (double) (rand() - RAND_MAX/2) / (double) (RAND_MAX );
c1 = c2 = 0.0;
}

Here my belive probably would be -

The compiler may not know the value of rand() & RAND_MAX, the compiler can assume that iterations may depend on each other. Also, aliasing can happen with "c1 = c2 = 0.0;", so I am getting the "LOOPS not parallelized". I am thinking to try with "#pragma parallel" before the for loop.

Suggest me if you have any insights.

I know you are programming in C++, but maybe you can follow the lead from Fortran and use a function that generates a list (array)of random numbers (or in this case two lists). Then perform the scaling of the returned random numbers. This would give you repeatability and provide you with a means to vectorize the loops for your learning experience.

Jim Dempsey

0 Kudos
levicki
Valued Contributor I
1,033 Views

I would like to kindly ask the moderators/admins to remove all offtopic conments from this thread starting from the one I reported as abusive and including this one after they clean it up or if the board software allows them to split the content into another thread.

This is "What is wrong with Intel Compiler 11.0" in general (as a product) thread, and not someone's specific code issues. That is what Start New Discussion button is for.

0 Kudos
hylton__ron
Beginner
1,033 Views

4. Important behavior changes aren't properly documented in Release Notes (applies to all compiler releases so far, not just 11.0).

I think this is a huge issue. A while ago I switched from VS2003 to VS2008 and a bunch of programs that had worked for a long time no longer compiled. After submitting a bug report it turned out that Intel has deliberately introduced major bugs which get activated when you use VS2008 (I don't know about VS2005) in the name of "Microsoft Compatibility". There is no way to turn them off so I asked Intel for the documentation on them - it's pretty hard to write code when you don't know the rules of the language (which is no longer C++ since Intel is deliberately violating the C++ Standard). That was 7 months ago, still no answer.


- Ron
0 Kudos
JenniferJ
Moderator
1,033 Views

With 14.0, we have a new online-installer. So it should solve the huge download issue. Well, it is just taking longer than expected to implement ....

Jennifer

0 Kudos
SergeyKostrov
Valued Contributor II
1,033 Views
>>...it is just taking longer than expected to implement... Thanks for the update, Jennifer. So, It is better late than never...
0 Kudos
TimP
Honored Contributor III
1,033 Views

And there is an option to fall back to the old installer in case you have only a normal quality on-line connection.

0 Kudos
Armando_Lazaro_Alami
1,033 Views

I agree with Igor regarding : 

3. Serious regressions aren't addressed quickly enough (#524001 comes to mind).

4. Important behavior changes aren't properly documented in Release Notes (applies to all compiler releases so far,not just 11.0).

In the case of point 3,  in 4 updates I found errors with 2 of them, so that we had to uninstall and returned to previous version. So, today we are afraid of any update, after a huge download and long instalation process finding that the update break your projects is not fair.

In the case of point 4 , I am very happy to know that other people share my impression.  I came from other compilers and found very confusing and poor the information regarding changes with each intel version of the compiler.

Unfortunately, I do not have a version 9.1 to compare,  But now I am curious about point 5 in Igor comment.

Thanks to Igor for pointing these problems.

0 Kudos
TimP
Honored Contributor III
1,033 Views

As to points 4 and 5, if you tuned up your source code to 9.1 and didn't adopt any new CPUs or compiler options, it's likely that you may not see much performance gain.  It's no secret that a major driver for the introduction of new compiler versions is to deliver value for new architectures.

With the increasing reliance on directives and occasional major changes between compiler versions, it's necessary to re-learn their use; this does in my view detract from the advertised advantage of them.

Current compiler versions have more need of #pragma nofusion to stop fusion of relatively misaligned loops and vectorizable with non-vectorizable ones; prior to xe 2011 #pragma distribute point was used in that sense as well as others. This change was slipped in silently; one might think that release notes explaining such changes might help, or that issue submissions relative to the changes could have drawn out this information.

Also in current compiler versions, OpenMP 4 style directives are gradually replacing earlier ones, but the small individual steps haven't been documented.  If needed documentation is too difficult or controversial a task, this has diminished value until the transition is complete, which apparently won't happen this year.

0 Kudos
SergeyKostrov
Valued Contributor II
1,033 Views
>>...we are afraid of any update, after a huge download and long instalation process finding that the update break >>your projects is not fair... You could try any Update for any software on a dedicated test computer or in VMWare environment. >>Unfortunately, I do not have a version 9.1 to compare... It is still available for download and I still use Intel C++ compiler version 8.1 ( Update 038 / Released in 2006 ).
0 Kudos
Armando_Lazaro_Alami
1,033 Views

Where can I find version 9.1 of the Intel compiler ?  Thanks.

0 Kudos
TimP
Honored Contributor III
1,033 Views

The procedure for downloading older compilers is the same for Windows, linux, Fortran, or C++:

http://software.intel.com/en-us/articles/older-version-product

0 Kudos
SergeyKostrov
Valued Contributor II
921 Views
>>With 14.0, we have a new online-installer. So it should solve the huge download issue. Well, it is just taking longer than expected to implement .... >> >>Jennifer If that new online-installer will look and work like Android SDK Manager ( No Issues or Problems for a long time I use it ) I'll be very impressed. This is how Android SDK Manager looks like: androidsdkmanager.jpg I wonder if any Beta testing with real customers was done?
0 Kudos
Reply