Q&A: Processor stall and optimizing assembled programs

Intel_Software_Netw1 · ‎03-29-2005

Here is a question received by Intel Software Network Support, followed by the response provided by our Application Engineers:

Q. I want to know how to optimize compiled/assembled programs. When I compile the program with Microsoft Visual c++ 7.0 I get some strange code and I have asked myself if that code is the fastest. So what is faster:

This
add ecx,ecx - 2 bytes
add ecx,ecx - 2 bytes
add ecx,ecx - 2 bytes

or this
shl ecx,3 - 3 bytes

This
xor eax,eax - 2 bytes
xor edx,edx - 2 bytes

or this
cdq - 1 byte
xchg eax,edx - 1 byte
cdq - 1 byte

When I change mov eax,0 to xor eax,eax then after xor-ing I have to put 3 bytes (90 90 90). So is this actually faster than mov eax,0? If it is not faster, then would it be faster if I somehow put the code together (erase those 3 bytes)?

Here is some "unusable code" that I have also found in some programs:

mov ecx,1
mov ??????,ecx
mov ecx,1
mov ??????,ecx
---------------
mov eax,ebx
mov ebx,eax - compiled by Delphi 7.0
-----------------
One more question.
If the code is smaller (measured in bytes) would it be faster too?

A. The efficiency of the different assembly code segments is dependent on the processor that they are executing on. Please reference the IA-32 Intel Architecture Optimization Reference Manual for more details. In summary, if you are running on the Pentium III or the Pentium M processor, the shl instruction will be faster than three add instructions. If you are running on the Pentium 4 processor, the three add instructions will be faster than the shl instruction. This is because the Pentium 4 processor is optimized for simple add instructions, but not for shift instructions. On the Pentium III and the Pentium M processors, both instructions (add and shl) take one clock cycle to execute. Size of the instructions is not a factor here, because all the processors can load 4 bytes at a time. Inefficiencies come into play when the instructions exceed 4 bytes in length.

With respect to the second sequence of instructions, the Optimization Reference Manual recommends (on page 2-69) that the CDQ instruction be replaced with the XOR instruction if the operands are known to be positive. In other words, the only time you should use CDQ is when you want to extend a negative number from 32 to 64 bits. Otherwise, the XOR instruction is more efficient. Comparing execution times from Appendix C (again, instruction size is not a factor because they are all less than 4 bytes), we find that the XCHG instruction takes 1.5 clock cycles to execute, while XOR takes either 1 clock cycle (Pentium III) or 0.5 clock cycles (Pentium 4) to execute. Thus, it is always faster to use two XOR instructions instead of the CDQ, XCHG, CDQ sequence.

In general, the XOR instruction will be faster than the MOV EAX, 0 instruction and it is also shorter.

Finally, the unusable code sequence that you found is due to the compiler which you used. You should file a bug with the compiler vendor to see if they can correct this. I would recommend that you try the Intel Compiler and see what kind of code it generates. We are confident that the Intel Compiler will generate the most efficient code for your x86 system. You can download a free evaluation copy to try out. Good luck!

==

Lexi S.

IntelSoftware NetworkSupport

http://www.intel.com/software

Contact us

Message Edited by intel.software.network.support on 12-07-2005 04:49 PM