- 신규로 표시
- 북마크
- 구독
- 소거
- RSS 피드 구독
- 강조
- 인쇄
- 부적절한 컨텐트 신고
Hello,
In 64-ia-32-architectures-optimization-manual, chapter B.3.7.2 Understanding the Sources of the Micro-op Queue it is said that UOPs come from DSB, MITE and MS, and a 'typical distribution' is given. It happens so that in the app I'm profiling quite a lot more UOPs are dispatched from MS than suggested as desirable by Intel in the manual while the execution is clearly front-end bound.
The problem is, I don't understand why that happens. The manual reads:
A large portion of micro-ops coming from the microcode sequencer may be benign, such as complex instructions, or string operations, but can also be due to code assists handling undesired situations like Intel SSE to Intel AVX code transitions.
But I am pretty sure there aren't any SSE/AVX instructions employed at all, nor could 'denormals' or string operations occur often enough to produce any notable amount of stirring (the code mainly works with integer values).
Is there a complete list of instructions that actually cause MS to insert UOPs to the queue? Any suggestions as to what I might have missed would also be most welcome.
- 신규로 표시
- 북마크
- 구독
- 소거
- RSS 피드 구독
- 강조
- 인쇄
- 부적절한 컨텐트 신고
AFAIK complex instruction which is decoded into more than 4 uops will be sent to Micro Sequencer. I cannot remember which exactly CPU architecture Pentium or later design incorporates this feature. I would recommend to search Google Patents database with the keyword "Intel CPU MicroSequencer"
링크가 복사됨
- 신규로 표시
- 북마크
- 구독
- 소거
- RSS 피드 구독
- 강조
- 인쇄
- 부적절한 컨텐트 신고
I don't know of any precise list of instructions that come from the MS. However, the latency of an instructions should be a good indicator. You can find the latency of instructions in Appendix C of the Intel® 64 and IA-32 Architectures Optimization Reference Manual
- 신규로 표시
- 북마크
- 구독
- 소거
- RSS 피드 구독
- 강조
- 인쇄
- 부적절한 컨텐트 신고
Thomas, thank you for the reply.
In the appendix C, part C.3.2 footnote 1 is reads:
Latency information for many instructions that are complex (> 4 μops) are estimates based on
conservative (worst-case) estimates. Actual performance of these instructions by the out-of-order
core execution unit can range from somewhat faster to significantly faster than the latency data
shown in these tables.
Does this mean 4+ latency implies the UOPs are coming from MS?
- 신규로 표시
- 북마크
- 구독
- 소거
- RSS 피드 구독
- 강조
- 인쇄
- 부적절한 컨텐트 신고
Agner Fog's "Instruction Tables" document lists the number of uops associated with each x86 instruction for a wide range of IA32 and x86_64 processors. Some of the uop counts have ranges, but it is certainly possible that the ones with single values might display different uop counts under exceptional conditions.
The "Instruction Tables" document is available in PDF format at: http://www.agner.org/optimize/instruction_tables.pdf
Lots of other detailed x86 resources are available in the parent directory: http://www.agner.org/optimize/
- 신규로 표시
- 북마크
- 구독
- 소거
- RSS 피드 구독
- 강조
- 인쇄
- 부적절한 컨텐트 신고
AFAIK complex instruction which is decoded into more than 4 uops will be sent to Micro Sequencer. I cannot remember which exactly CPU architecture Pentium or later design incorporates this feature. I would recommend to search Google Patents database with the keyword "Intel CPU MicroSequencer"
- 신규로 표시
- 북마크
- 구독
- 소거
- RSS 피드 구독
- 강조
- 인쇄
- 부적절한 컨텐트 신고
I suppose that probably complex instructions like vsqrtps or fsin are executed by the microcode which is injected by the MicroSequencer.
- 신규로 표시
- 북마크
- 구독
- 소거
- RSS 피드 구독
- 강조
- 인쇄
- 부적절한 컨텐트 신고
Hello,
It took me quite some time, but yes: it seems that all UOPs with 4+ cycle latencies are handled by MS.
Thank you for your expertise!
- 신규로 표시
- 북마크
- 구독
- 소거
- RSS 피드 구독
- 강조
- 인쇄
- 부적절한 컨텐트 신고
Can you show a portion of assembly code which belongs to your profiled application and which also contains complex uops?
