What port(s) are reffered to by the "Cycles of Port X Utilized"

Curtis_H_ · ‎05-22-2015

Using the General Exploration analysis in VTune 2015 will deliver several columns that refer to port utilization. For example, "Cycles of 1 Port Utilized". The documentation on these columns is, well, less than helpful. What are the ports? What do they do? If code heavily using the ports, what can be done about it?

McCalpinJohn · ‎05-25-2015

The execution ports are described in Chapter 2 of the Intel Optimization Reference Manual (document 248966, revision 030, September 2014).

For Haswell family cores, Figure 2-1 on page 2-2 shows the 8 execution ports, which provide the interfaces between instruction issue and the various functional units. Several of the execution ports connect to a single functional unit (e.g., the load and store units), while ports 0,1,5,6 each connect to many functional units.

Similar information is provided for the Sandy Bridge (and Ivy Bridge) processors in Figure 2-4 on page 2-9.

If a port is heavily used, it is most often because your code doing operations that map to one or more of the functional units that are reached through that port. For example, a code that is dominated by floating point additions will issue most of its uops to Port 1. Other instructions (such as incrementing pointers and the compare and branch instructions for loop control) could also be issued to the "ALU" function behind Port 1, but the hardware will dynamically select an unused port for these instructions if one is available, so they will almost always use Ports 0, 5, and 6. So heavy use in this case is not necessarily something you can "fix", but it can be very useful for telling you that you are already close to the maximum execution rate of the processor for your mix of instructions.

There are some cases for which alternate instruction choices can change the ports used, and potentially allow some increase in performance. The example I have seen discussed in the most detail has to do with rearranging data in the SIMD registers. These "shuffle" instructions can only be issued on Port 5 in recent processors, which limits performance to one operation per cycle. However, the same effect can often be obtained by re-loading the data with a different offset into the SIMD registers. This uses the load capability of Ports 2 and 3, and so can execute at the same time as the shuffle instructions executing on Port 5.

A comprehensive list of the mapping of instructions to execution ports for nearly every x86 processor ever made is included in the document "instruction_tables.pdf" at www.agner.org/optimize/. You will probably need the document "microarchitecture.pdf" from the same site to make sense of these tables.