Solved: Macro-fusion merges two instructions into a single micro-op?

111alan · ‎03-11-2020

The original doc is here, at 3.4.2.2:

https://www.intel.com/content/dam/www/public/us/en/documents/manuals/64-ia-32-architectures-optimization-manual.pdf

Isn't the"micro-op" supposed to be "macro-op"?

On this page(and some other sites and papers), intel's macro-op is refered to pre-decoded x86 instructions, and shouldn't be converted to micro-op before it reaches the main decoder:

https://en.wikichip.org/wiki/intel/microarchitectures/skylake_(client)

Just wondering if it's just wrongly typed, or are there any interpretation error on my side.

Thx.

McCalpinJohn · ‎03-12-2020

Yes, the "macro-op fusion" of the common case of a "compare" instruction followed immediately by a "conditional jump" instruction results in a single micro-op that requires a single dispatch to an execution port.

The best explanation that I have seen of the Intel pipelines is in the document "microarchitecture.pdf" available at https://www.agner.org/optimize/

The document describes Intel (and AMD) microarchitectures chronologically. I like this approach for three reasons:

The earlier implementations are less complex, so they are easier to understand.
The later implementations often have features that only make sense if you understand the history of the design.
Having the descriptions written by a single person makes the nomenclature and descriptive style much more uniform.

View solution in original post

McCalpinJohn · ‎03-11-2020

The terminology used on the wikichip page is not consistent with how Intel uses the terms. They are trying to emphasize that x86 instructions can correspond to multiple micro-ops, but they should not be calling them "macro-ops" -- they should call them "instructions". Intel only uses the term "macro" in reference to "macro-op fusion".

I think that one reason that the terminology is confusing is that Intel's recent implementations are based on older implementations -- developed in an era with a lot fewer transistors. So some of the approaches look overly complex. For example, section 2.5.2.1 of the optimization reference manual discusses both "micro-fusion" and "macro-fusion" for the Sandy Bridge core. A sequence of transformations might be:

An x86 instruction includes more than one action (e.g., compute and memory reference)
The instruction decoder in *earlier* Intel processors would have produced a compute micro-op and a memory micro-op.
The instruction decoder in Sandy Bridge supports "complex micro-ops", so it can "fuse" what would have been two micro-ops in an earlier architecture to a single micro-op.

The "macro-op fusion" is slightly different than the "micro-fusion" because it is fusing micro-ops that came from different x86 instructions. The fusion of the compare and branch operations into a single micro-op reduces dispatch bandwidth, but probably more importantly also eliminates the single-cycle latency that might normally be expected between the execution of the compare instruction and the execution of the dependent conditional branch.

111alan · ‎03-11-2020

McCalpin, John (Blackbelt) wrote:
The terminology used on the wikichip page is not consistent with how Intel uses the terms. They are trying to emphasize that x86 instructions can correspond to multiple micro-ops, but they should not be calling them "macro-ops" -- they should call them "instructions". Intel only uses the term "macro" in reference to "macro-op fusion".
I think that one reason that the terminology is confusing is that Intel's recent implementations are based on older implementations -- developed in an era with a lot fewer transistors. So some of the approaches look overly complex. For example, section 2.5.2.1 of the optimization reference manual discusses both "micro-fusion" and "macro-fusion" for the Sandy Bridge core. A sequence of transformations might be:
An x86 instruction includes more than one action (e.g., compute and memory reference)
The instruction decoder in *earlier* Intel processors would have produced a compute micro-op and a memory micro-op.
The instruction decoder in Sandy Bridge supports "complex micro-ops", so it can "fuse" what would have been two micro-ops in an earlier architecture to a single micro-op.
The "macro-op fusion" is slightly different than the "micro-fusion" because it is fusing micro-ops that came from different x86 instructions. The fusion of the compare and branch operations into a single micro-op reduces dispatch bandwidth, but probably more importantly also eliminates the single-cycle latency that might normally be expected between the execution of the compare instruction and the execution of the dependent conditional branch.

Thank you for the reply.

just understood it a bit better, so the macro-op-fusion fuses 2 (macro)instructions into 1 complexed micro-op, which can be excuted directly?

still wondering when does this macro-fusion take place, is it happening during decoding, or when the instructions are in the instruction queue?

McCalpinJohn · ‎03-12-2020

Yes, the "macro-op fusion" of the common case of a "compare" instruction followed immediately by a "conditional jump" instruction results in a single micro-op that requires a single dispatch to an execution port.

The best explanation that I have seen of the Intel pipelines is in the document "microarchitecture.pdf" available at https://www.agner.org/optimize/

The document describes Intel (and AMD) microarchitectures chronologically. I like this approach for three reasons:

The earlier implementations are less complex, so they are easier to understand.
The later implementations often have features that only make sense if you understand the history of the design.
Having the descriptions written by a single person makes the nomenclature and descriptive style much more uniform.

111alan · ‎03-13-2020

McCalpin, John (Blackbelt) wrote:
Yes, the "macro-op fusion" of the common case of a "compare" instruction followed immediately by a "conditional jump" instruction results in a single micro-op that requires a single dispatch to an execution port.
The best explanation that I have seen of the Intel pipelines is in the document "microarchitecture.pdf" available at https://www.agner.org/optimize/
The document describes Intel (and AMD) microarchitectures chronologically. I like this approach for three reasons:
The earlier implementations are less complex, so they are easier to understand.
The later implementations often have features that only make sense if you understand the history of the design.
Having the descriptions written by a single person makes the nomenclature and descriptive style much more uniform.

Thanks, these docs are quality work. I'll take a deep look into those for a few days.

111alan · ‎03-19-2020

McCalpin, John (Blackbelt) wrote:
Yes, the "macro-op fusion" of the common case of a "compare" instruction followed immediately by a "conditional jump" instruction results in a single micro-op that requires a single dispatch to an execution port.
The best explanation that I have seen of the Intel pipelines is in the document "microarchitecture.pdf" available at https://www.agner.org/optimize/
The document describes Intel (and AMD) microarchitectures chronologically. I like this approach for three reasons:
The earlier implementations are less complex, so they are easier to understand.
The later implementations often have features that only make sense if you understand the history of the design.
Having the descriptions written by a single person makes the nomenclature and descriptive style much more uniform.

I actually found an issue in the microarchitecture.pdf. According to the "Intel® 64 and IA-32 Architectures Optimization Reference Manual" and "ECE 4750 Computer Architecture Intel Skylake", shouldn't the legacy decoder in Skylake be 5-way(1 complex + 4 simple) when fusion isn't considered(more if fusion is considered)?

Agner's document said"There are four decoders, which can handle four instructions (five or six with fusion) generating up to four μops per clock cycle", but intel's and every other sources said"Legacy Decode Pipeline delivery of 5 uops per cycle to the IDQ compared to 4 uops in previous generations."

Which one is true?

Thank you.

McCalpinJohn · ‎03-19-2020

I suspect that the difference here is just a different way of counting complex uops, but it would take a fair bit of work to be sure....

The description of "micro-fusion" in the Sandy Bridge section of the optimization manual suggests that Sandy Bridge adds support for new "complex uops" that can include (for example) a compute uop and a memory uop from the same x86 instruction. These are still *dispatched* to the execution ports as separate uops, but count as a single uop of decoder output bandwidth. It is possible that these complex uops also save space in the decoded uop cache, but I have not seen any specific references to this.

I only count uops at dispatch, so I will see more uops going to the execution ports than are generated by the decoders (assuming complex uops are being generated, which is usually true for my codes....)

Hmmm.... I wonder if this is related to the Haswell bug in counting retired instructions (which, if I recall correctly, also applies to uops retired). I suspect that all of the cases I tested would have included complex uops -- most were fine, but I had a few examples with consistently biased counter results....

111alan · ‎03-21-2020

McCalpin, John (Blackbelt) wrote:
I suspect that the difference here is just a different way of counting complex uops, but it would take a fair bit of work to be sure....
The description of "micro-fusion" in the Sandy Bridge section of the optimization manual suggests that Sandy Bridge adds support for new "complex uops" that can include (for example) a compute uop and a memory uop from the same x86 instruction. These are still *dispatched* to the execution ports as separate uops, but count as a single uop of decoder output bandwidth. It is possible that these complex uops also save space in the decoded uop cache, but I have not seen any specific references to this.
I only count uops at dispatch, so I will see more uops going to the execution ports than are generated by the decoders (assuming complex uops are being generated, which is usually true for my codes....)
Hmmm.... I wonder if this is related to the Haswell bug in counting retired instructions (which, if I recall correctly, also applies to uops retired). I suspect that all of the cases I tested would have included complex uops -- most were fine, but I had a few examples with consistently biased counter results....

I mean in the description in Skylake section in all documents when I quoted the 2 sentences. When fusion and complexed instructions(I mean those which can emit multiple uops when decoded) are all not considered, shouldn't be the bandwidth of legacy decoder be at least 5uOPs in skylake(5-way decode)? Or else there won't be the introdcution of

"Legacy Decode Pipeline delivery of 5 uops per cycle to the IDQ compared to 4 uops in previous generations"

in the Skylake section of intel's optimization manual, because the previous generation can also do more than 4uops, if complexed instructions are considered. I also didn't find proof of any other bottlenecks in the front end to limit the legacy decoder to less than 5uops, wherever you measure the width from.

I think Agner's problem is that he thought the decoder of Skylake is the same as Sandybridge, or he counted the retire bandwidth of the entire pipeline as the decoder bw. But I can't be sure. Or all other sources could be wrong, but it's not likely.

The documents I looked into:

https://www.csl.cornell.edu/courses/ece4750/2016f/handouts/ece4750-section-skylake.pdf

https://www.intel.com/content/dam/www/public/us/en/documents/manuals/64-ia-32-architectures-optimization-manual.pdf

https://en.wikichip.org/wiki/intel/microarchitectures/skylake_(client)

BTW are there any detailed description about the instruction-retired counter bug? would like to look into it.

Thank you.

McCalpinJohn · ‎03-24-2020

I found my write-up on the Haswell instructions retired bug from 2016: https://www.agner.org/optimize/blog/read.php?i=452&v=t

In the case that I found, the loop clearly had 10 instructions which should have mapped to 12 uops, but the instructions retired counter incremented by 12 and the uops retired counter incremented by 14. In this case the counts were consistent, but the description of HSE71 in the Xeon E5 v3 specification update (Intel document number 330785) says that the mis-counting is not always repeatable.

The specification update for the "Desktop 4th Generation Intel Core Processor Family [...]" (document 328899) includes errata HSD140, which looks the same as HSE71. This document includes 155 errata, vs 109 in the Xeon E5 v3 specification update, and it looks like many of the extras probably apply to the Xeon E5 v3 as well. The Xeon E3 1200 v3 specification update is another relevant reference, with errata HSW141 mapping to HSE71 on the Xeon E5 v3.

111alan · ‎03-24-2020

McCalpin, John (Blackbelt) wrote:
I found my write-up on the Haswell instructions retired bug from 2016: https://www.agner.org/optimize/blog/read.php?i=452&v=t
In the case that I found, the loop clearly had 10 instructions which should have mapped to 12 uops, but the instructions retired counter incremented by 12 and the uops retired counter incremented by 14. In this case the counts were consistent, but the description of HSE71 in the Xeon E5 v3 specification update (Intel document number 330785) says that the mis-counting is not always repeatable.
The specification update for the "Desktop 4th Generation Intel Core Processor Family [...]" (document 328899) includes errata HSD140, which looks the same as HSE71. This document includes 155 errata, vs 109 in the Xeon E5 v3 specification update, and it looks like many of the extras probably apply to the Xeon E5 v3 as well. The Xeon E3 1200 v3 specification update is another relevant reference, with errata HSW141 mapping to HSE71 on the Xeon E5 v3.

Thank you for the info. But I also can't find SKD044 in the files.

Since the counters seems unreliable, I think I'd better stick to the disclaimed value rather than tested value for now.