- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Here is a question received by Intel Software Network Support, followed by the response provided by our Application Engineers:
Q. I want to know how to optimize compiled/assembled programs. When I compile the program with Microsoft Visual c++ 7.0 I get some strange code and I have asked myself if that code is the fastest. So what is faster:
This
add ecx,ecx - 2 bytes
add ecx,ecx - 2 bytes
add ecx,ecx - 2 bytes
or this
shl ecx,3 - 3 bytes
This
xor eax,eax - 2 bytes
xor edx,edx - 2 bytes
or this
cdq - 1 byte
xchg eax,edx - 1 byte
cdq - 1 byte
When I change mov eax,0 to xor eax,eax then after xor-ing I have to put 3 bytes (90 90 90). So is this actually faster than mov eax,0? If it is not faster, then would it be faster if I somehow put the code together (erase those 3 bytes)?
Here is some "unusable code" that I have also found in some programs:
mov ecx,1
mov ??????,ecx
mov ecx,1
mov ??????,ecx
---------------
mov eax,ebx
mov ebx,eax - compiled by Delphi 7.0
-----------------
One more question.
If the code is smaller (measured in bytes) would it be faster too?
A. The efficiency of the different assembly code segments is dependent on the processor that they are executing on. Please reference the IA-32 Intel Architecture Optimization Reference Manual for more details. In summary, if you are running on the Pentium III or the Pentium M processor, the shl instruction will be faster than three add instructions. If you are running on the Pentium 4 processor, the three add instructions will be faster than the shl instruction. This is because the Pentium 4 processor is optimized for simple add instructions, but not for shift instructions. On the Pentium III and the Pentium M processors, both instructions (add and shl) take one clock cycle to execute. Size of the instructions is not a factor here, because all the processors can load 4 bytes at a time. Inefficiencies come into play when the instructions exceed 4 bytes in length.
With respect to the second sequence of instructions, the Optimization Reference Manual recommends (on page 2-69) that the CDQ instruction be replaced with the XOR instruction if the operands are known to be positive. In other words, the only time you should use CDQ is when you want to extend a negative number from 32 to 64 bits. Otherwise, the XOR instruction is more efficient. Comparing execution times from Appendix C (again, instruction size is not a factor because they are all less than 4 bytes), we find that the XCHG instruction takes
1.5 clock cycles to execute, while XOR takes either 1 clock cycle (Pentium III) or 0.5 clock cycles (Pentium 4) to execute. Thus, it is always faster to use two XOR instructions instead of the CDQ, XCHG, CDQ sequence.
In general, the XOR instruction will be faster than the MOV EAX, 0 instruction and it is also shorter.
Finally, the unusable code sequence that you found is due to the compiler which you used. You should file a bug with the compiler vendor to see if they can correct this. I would recommend that you try the Intel Compiler and see what kind of code it generates. We are confident that the Intel Compiler will generate the most efficient code for your x86 system. You can download a free evaluation copy to try out. Good luck!
==
Message Edited by intel.software.network.support on 12-07-2005 04:49 PM
Link Copied
0 Replies
![](/skins/images/895D6060305DF45A57FACF854B5A8CD1/responsive_peak/images/icon_anonymous_message.png)
Reply
Topic Options
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page