Software Tuning, Performance Optimization & Platform Monitoring
Discussion regarding monitoring and software tuning methodologies, Performance Monitoring Unit (PMU) of Intel microprocessors, and platform updating.
The Intel sign-in experience has changed to support enhanced security controls. If you sign in, click here for more information.

Double-speed alu and dependencies



Some architecture have double-speed ALUs (port 0,1..). They can often be used as such for arithmetic operations like add, sub..

On the other hand, a shift will may be executed in a normal speed alu.

Now, let's abstract.

By looking in the "architecture optimization reference" by Intel, say that the latency for an add is 1 and throughput is 0.33 (arch doesn't matter for the reasoning here).

Does that mean that I should divide that number by two, since they will be treated  by the double speed ALU and consider that the latency is 0.5 and the throughput is 0.16  for one instruction?

Otherwise I see no point to convert two add to replace a shift by 2, since the latency is the same and the throuput would be greater (0.33*2 vs 0.5).

The dependency is so small (1 cycle for both) anyway. But can get decisive if ADD timings should be divided by 2. That way, ADD would be clearly superior.

May you confirm or infirm my analysis, please?

0 Kudos
3 Replies
Black Belt

I don't think that there are any "double-speed" ALUs in recent Intel processors. 

In the example you cited, the throughput of 0.33 simply indicates that there are three ALUs that can all execute an ADD instruction every cycle.  The latency of one cycle simply means that the result of each ADD operation is available for use as in input in the cycle after the ADD executes.  It typically requires carefully constructed tests to verify these behaviors.

There are clearly some cases for which the same effect can be obtained by different instructions.  In some of these cases the set of instructions that are useful are supported by different numbers of execution ports.   For extremely tight code it is possible to see a difference in throughput for the various instruction choices, but under more typical circumstances (e.g., code with occasional cache misses), the overall throughput is not high enough to make the difference in execution port count for the different instructions significant in the overall execution time.



I was surprised because here is the quote:

The four ports through which μops are dispatched
to execution units and to load and store operations are shown in Figure 2-6. Some
ports can dispatch two μops per clock. Those execution units are marked Double

This is in Intel architecture optimization manual from 2010 (I have to take into account Core2 as well in my dev). So maybe it is a legacy definition that I should not take care of after all.


Black Belt


Sometimes old information that's mean related to older CPU is still retained in those manuals. Maybe that quoted sentence refers to Pentium 4 ALU.


As far as my understanding goes and in case of the floating point adder and multiplier units. I suppose that different units can share load an available resources between them. I mean that in case of some  floating point mul operation one multiplier can operate on mantissas and the adder  on the exponent part after the results are available some control signal is asserted and the results are combined and sent to phys register.