Intel® C++ Compiler
Community support and assistance for creating C++ code that runs on platforms based on Intel® processors.
7944 Discussions

64bit code works significantly slower than 32bit code

romant73
Beginner
1,015 Views
Hi All,

Here are some benchmarks (time that test requires to complete) of the code being compiled and tested on x86 and x64 platforms using MSVC and Intel C++ Compiler.

MSVC 32 bit: 195.9 seconds
Intel 32 bit: 178.1 seconds (17.8 seconds faster, very good)

MSVC 64 bit: 194.9 seconds (time of execution is almost the same with MSVC 32 bit cde)
Intel 64 bit: 187.3 seconds (nine seconds slower than Intel 32 bit code)

MSVC code does not degrade when compiling for x64 while Intel code becomes slower for x64. Are there any tricks to make 64bit Intel code as fast as 32bit Intel code ?

Thanks in advance.
0 Kudos
12 Replies
Lingfeng_C_Intel
Employee
1,015 Views
Thanks Romant,

Could let us know what kind of opt opinion do you setup in your project? For example, what kind of optimization level do you set? Do you use SIMD optimization? or what target Machine do you setup for project configure? All these opinions will impact your program perfromance.

Hope it can help you!

Thanks,
Wise
0 Kudos
romant73
Beginner
1,015 Views
0 Kudos
Lingfeng_C_Intel
Employee
1,015 Views
Thanks Romant,

Unfortunately, I can't download your png files from your URL. Could you send them to me through email? My email address is wise.chen@intel.com

Thanks,
Wise
0 Kudos
Lingfeng_C_Intel
Employee
1,015 Views
Thanks Romant,

I got your configure pictures from my team member. I didn't see any issue on it. One more two questions:
1. Could you provide me your machine configure in which you did test?
2. Could you provide me the 'Linker' option setup in your project property pages?

Thanks,
Wise
0 Kudos
romant73
Beginner
1,015 Views
Wise,

Machine: Intel Core i7 920 CPU (overclocked to 3.33 Ghz), XP64 English OS.

Linker command line (32bit), excluded list of input libs:
/INCREMENTAL:NO /nologo /NODEFAULTLIB:"libcmt.lib" /NODEFAULTLIB:"libcmtd.lib" /NODEFAULTLIB:"libcpmt.lib" /NODEFAULTLIB:"libcpmtd.lib" /TLBID:1 /DEBUG /SUBSYSTEM:WINDOWS /LARGEADDRESSAWARE /OPT:REF /OPT:ICF /ENTRY:"wWinMainCRTStartup" /MACHINE:X86 /FIXED:NO

Linker command line (64bit), excluded list of input libs:
/INCREMENTAL:NO /nologo /NODEFAULTLIB:"libcmt.lib" /NODEFAULTLIB:"libcmtd.lib" /NODEFAULTLIB:"libcpmt.lib" /NODEFAULTLIB:"libcpmtd.lib" /TLBID:1 /DEBUG /SUBSYSTEM:WINDOWS /LARGEADDRESSAWARE /OPT:REF /OPT:ICF /ENTRY:"wWinMainCRTStartup" /MACHINE:X64 /FIXED:NO

Interprocedural optimization is enabled in both configurations. Please, let me know if this info is not enough.
0 Kudos
mecej4
Honored Contributor III
1,015 Views
The timings that you display do not strike me as justifying the attribute "significantly slower". There are many compiler options that can be used to tune your application. A cursory look at your screenshots showed me that you had not selected the option to generate code specific to your CPU.
0 Kudos
romant73
Beginner
1,015 Views
Quoting mecej4
The timings that you display do not strike me as justifying the attribute "significantly slower". There are many compiler options that can be used to tune your application. A cursory look at your screenshots showed me that you had not selected the option to generate code specific to your CPU.

My question is: what must I do in order to make 64bit code as fast as 32bit code ? I'm open to any suggestions and experiments as Intel C++ Compiler is a new product for me.
0 Kudos
Dale_S_Intel
Employee
1,015 Views
That is a pretty difficult question to answer without some representative example to analyze. Is there a particular part of the code that is noticeably slower? Could you post a small kernel that shows a similar problem?
Thanks!
Dale
0 Kudos
romant73
Beginner
1,015 Views
Unfortunately, I can't extract a piece of code to analyze ... I can only describe what the code does. It is a number of sets of nested loops on a number of arrays, each array contains about 500 thousands float numbers. In other words, pure number crunching that involves simple arithmetic operations at most.
0 Kudos
Lingfeng_C_Intel
Employee
1,015 Views
Thanks Romant,

It is hard for us to tune your code with your source code.
I read your configure for C/C++ and Linker. Just from configure setup of optimization view, please Enable your 'Interprocedureal Optimization' of OPtimization of Linker and try it again.

Hope it can help you.

Thanks,
Wise
0 Kudos
jimdempseyatthecove
Honored Contributor III
1,015 Views
Quoting romant73
Unfortunately, I can't extract a piece of code to analyze ... I can only describe what the code does. It is a number of sets of nested loops on a number of arrays, each array contains about 500 thousands float numbers. In other words, pure number crunching that involves simple arithmetic operations at most.

Try this:

Run a profiler to find the hot spots of your application. Select a few of the hottest for further consideration. Select thesame functions for both x32 and x64 builds.Set a break point in the hot spot(s) in each configuration. Note, doing this in a fully optimized program may bedifficult. Break on entry into the function, and not on the hottests statement. If necessary, to avoid some inline-ing, you may need to add a dummy function call that is not inlined and break in that function, then step out after break.

When in the function containing the hot spot, open a dissassembly window and copy to an editor. (or screenshot to get disassembled code). Do the same for the other configuration. IOW get dissassembly for x32 and x64. You can also compile with option to produce ASM listing (produce listing with code bytes).

By comparing the two you can see which is performing more work than the other and/or has more bytes per instruction.

Often you may find that x64 is converting 32-bit int to 64-bit in the process of producing array indexes. In these situations, consider promoting the loop index to intptr_t as this is an int of size of addressing register.

Other differences is x64 almost always uses the XMM registers, where x32 may selectively use the XMM registers. These are used for the SSE instructions. While SSE is generallyfaster than FPU, there are some cases where it is not. You might keep this in mind when you compare the code.

Also, if you note 0x66 or 0x67 bytes in the instruction (usually at front) these are data size and address sizeoverloadprefexes. An excess of these will enlarge the number of bytes to execute your loops. At some point this may cause some loops to not fit within the L1 instruction cache.

Don't forget the often aggressive inlining can also cause some loops to spill out of the L1 instruction cache.

Jim Dempsey
0 Kudos
romant73
Beginner
1,015 Views
Jim, thank you very much for describing the strategy, definitely, I will try low level analysis out.
0 Kudos
Reply