Instruction execution on the Core processor

david_livshin1 · ‎06-26-2007

Hi,

"Intel 64 and IA-32 Architectures Optimization Reference Manual" lists issue ports and execution units for Core Microarchitecture ( see table 2-2 in 2.1.3.1 ) but there is no data about what execution units execute specific instructions ( even for 1 mop instructions ) - this is essential for efficient scheduling of the code ( note that for the NetBurst Microarchitecture this information is provided ). Where can I find this information?

Thank you.

David Livshin

http://www.dalsoft.com

Intel_Software_Netw1 · ‎06-26-2007

Our engineering contacts responded:

To be able to schedule ports/execution units exactly as the programmer intended at coding time is a noble paradigm. It is a great challenge that requires efforts not just in coding, but verification of what is accomplished vs. what was intended.

The challenge is especially steep with out-of-order microarchitecture. I think a general statement can be made that in order to succeed in coding assembly sequence for (nearly) perfect scheduling of ports/execution units at a source code level, you need to know all the initial and boundary conditions, plus constraints that can influence the operation of issue ports/execution units.

Does all of this information exist, andis all of this information useful in practice? A few things that might not be obvious to the programmers are:

1. Detailed hardware design informationis not necessarily meaningful at the software level. For example, there exists hardware design informationregarding how many cycles certain operations will take at each pipe stage in a given microarchitecture, which the optimization manual does not cover. This informationis not relevant to software performance nor instruction latency, although one may find it interesting in terms of how many cyclesan instruction actually takes to execute from fetch to retirement.

2. Microachitectures are designed with the well-known philosophy "Make the common case(s) fast", not make every case fast or the same. If one were indeedequipped with every piece of the multi-variable nature of design parameters dealing with all the initial, boundary and constraints, hazard penalty data of every sub-system in a microarchitecture, I would also venture to say that one of two things would happen: (i) using a small subset of such data to schedule instructions might lead to the observationthat what was intended at coding time does not match what happened at runtim e; (ii) an attempt to take enough of the hardware design parameters to form a more complete model of scheduling the port/execution units of one specific microachitecture would evolve into a life of its own and detract fromthe original objective of application development. Considering future changes to microarchitecture every two years, I don't think this type of effort would be practical.

So, we believe that to extract more performancefrom a given microarchitecture, itis more practical to take a feedback-loop-based approach to verify what was accomplished in execution/ports at runtime, vs. focusingon code-sequencing to achieve optimal scheduling of ports/execution.

In the latest revison of theIntel 64 and IA-32 Architectures Software Developer's Manual,we included a few more events for processors based on Intel Core microarchitecture that can measure the micro-ops that were issued in a given port. See event code A1H.

==

Lexi S.

IntelSoftware NetworkSupport

http://www.intel.com/software

Contact us

david_livshin1 · ‎06-27-2007

Will try to keep my response to the above short and focused, saying only that information I am looking for will ( no doubt ) prove to be essential in implementing x86 Core optimizer I am working on - this was the case with the x86 Pentium4 optimizer ( see this for more ) and, due to shorter latencies, is even more important for Core.

Please provide me with the data I need to do my work:

by what execution units of a Core processor the specific x86 instructions are executed.

Thank you,

David Livshin

http://www.dalsoft.com

Intel_Software_Netw1 · ‎06-27-2007

David,

We will contact you by email to help determine your precise needs,as we've been advised that blanket disclosure of execution unit assignment for all instructions in the public forums raises some confidentiality issues.

==

Lexi S.

IntelSoftware NetworkSupport

http://www.intel.com/software

Contact us

Intel_Software_Netw1 · ‎07-16-2007

Here is some information we are able to share after a conversation with our Application Engineers.

Some of the following can be inferred from Table 2-2 of the Intel 64 and IA-32 Architectures Optimization Reference Manual:

add, sub, and, xor, cmp, test: (register flavor) these are simple integer ALU operations, so Table 2-2 provides the details. They can be issued to either port 0, 1, or 5. When you use these instructions with load intent, the load operation will be issued from port 2. If you use them with store intent, the instruction decodes into 3 ops: the store address operation goes through port 3, the store data operation goes through port 4.

The branch execution unit is shown in Figure 2-1 but not marked; it is associated with port 5. However, the request for more detail for jmp and call is not very feasible. Jmp and call support many varieties of syntax, and the microarchitecture handles different flavors differently. Suffice to say that short conditional jumps like jcc may be interpreted as a relatively simple operation that involve the JEU in port 5. More complex CISC-like JMP and CALL are rather complex; the tuning implication is that whenever you attempt to use far jumps or call, you will likely create bubbles and/or encounter resource implications. CISC-like instruction gets its long ops flow from a ROM inside the CPU, and getting ops from that ROM is slower than having all 4 decoders decoding simple non-CISC instructions.

addsd, addpd, subsd, subpd, are common SIMD FP add operations, Table 2-2 shows them in port 0.

mulsd, mulpd are in port 1.

andpd, xorpd are SIMD ALU operations, so three issue ports are available.

Here's some more information that may be useful for understanding the execution of certain instructions in the OOO engine. This information is specific to 65nm generation processors based on Intel Core microarchitecture.

1. lea, movhlps, movddup, unpckhpd are always executed through port 0

2. movapd, movsd can be dispatched through either port 0, 1, or 5

3. pshufd, punpckhbw contain ops that must execute through port 0 and port 5

4, movhpd contains ops that must execute through port 1

5. One of the ops in shufpd must execute through port 0

Please note that the binding of issue ports to execution units can vary between some steppings in 65nm products. There are even more significant changes inside the OOO engine (where the binding of execution units/issue ports are part of the OOO engine) in 45nm products. One must keep these facts in mind to balance the trade-off of potential performance gain vs. contiuous maintenance over different processor generations when attempting to write asm code to explicitly control the scheduling of instructions.

In 65nm processors based on Intel Core microarchitecture, some of the SIMD data movement instructions (dealing with unpack, shuffle operati ons) employ shuffle units that process 64 bits of data in the OOO engine. Those situations (see item 3 above) are improved significantly in the 45nm processor generation based on Penryn with a 128-bit shuffle unit.

==

Lexi S.

IntelSoftware NetworkSupport

http://www.intel.com/software

Contact us