- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I was trying to measure the latency/throughput of vpxor and found that if I repeatedly execute:
vpxor %xmm0,%xmm4,%xmm4
vpxor %xmm2,%xmm4,%xmm4
vpxor %xmm0,%xmm4,%xmm4
vpxor %xmm2,%xmm4,%xmm4
I get a throughput of 0.5 which implies a latency of 2. However if I run the following:
vpxor %xmm0,%xmm4,%xmm4
vpxor %xmm4,%xmm0,%xmm0
vpxor %xmm0,%xmm4,%xmm4
vpxor %xmm4,%xmm0,%xmm0
I measure a throughput of 1 and a latency of 1.
Why am I observing this difference? SC scheduling of the operations? I observe also that the operations are flowing evenly down all 3 pipes which can execute vpxor (pipes 0, 1 and 5).
Lastly, there's no SC stall measuring the RESOURCE STALLS in PMC 0xA2. Your PMC stats for 0xA2 imply that you're stalled by your ROB. If you're scheduler is 54 entries as documented wouldn't you stall on SC before you stall in the ROB as denoted in your optimization manual?
Thanks for any valuable pointers to the insighful questions I raise above.
Perfwise
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page