Solved: Error in pseudo-code for RDPMC in SWDM Volume 2

McCalpinJohn · ‎06-19-2018

I am pretty sure there is a typo (and an inconsistency in notation) in the pseudo-code for the RDPMC instruction in Volume 2 of the SW Developer's Manual. (I am working from document 325383-067, May 2018.)

The first piece of the pseudo-code is for Intel processors that support architectural performance monitoring, and is trying to show how bit 30 of the counter number in %ecx is used to determine whether the remainder of the bits in %ecx are used to select a fixed-function counter number or a programmable counter number.

IF (ECX[30] = 1 and ECX[29:0] in valid fixed-counter range)
        EAX ← IA32_FIXED_CTR(ECX)[30:0];
        EDX ← IA32_FIXED_CTR(ECX)[MSCB:32];
ELSE IF (ECX[30] = 0 and ECX[29:0] in valid general-purpose counter range)
        EAX ← PMC(ECX[30:0])[31:0];
        EDX ← PMC(ECX[30:0])[MSCB:32];

Note the difference in syntax between the pairs of assignments to EAX and EDX -- the second pair looks correct. The first pair uses an ambiguous notation, but in the first statement the "[30:0]" is incorrect whether it is in reference to the bits of ECX (which should be [29:0]) or in reference to the bits of the counter (which should be [31:0]).

Using the second pair as a guide to the syntax, the first pair should probably be:

        EAX ← IA32_FIXED_CTR(ECX[29:0])[31:0];
        EDX ← IA32_FIXED_CTR(ECX[29:0])[MSCB:32];

This confusion is repeated in the first pair of assignments in the second section of pseudo-code ("Intel Core 2 Duo processor family [...]") and in the first pair of assignments in the third section of pseudo-code ("P6 family processors [...]").

MarkC_Intel · ‎06-20-2018

Hi John, for the first post, we are working on updating the SDM. Thanks for pointing out the ambiguity.

For the 2nd post, as far as the ICC goes, I spoke w/one of our developers and he said RDPMC is assumed to write 32b values in the implicit regs. So we are not sure where that zeroing idiom you cite is coming from. He suggests it might be from inline asm constraints specified by user program. Guess we'd need more information if you want to pursue that further.

From an architecture perspective, whenever the processor writes 32b values in to a general purpose register, implicit or explicit, the upper 32b are zeroed.

View solution in original post

McCalpinJohn · ‎06-19-2018

While I am on the topic of RDPMC....

In Volume 1 of the SWDM (253665-067), section 3.4.1.1 says that in 64-bit mode:

32-bit operands generate a 32-bit result, zero-extended to a 64-bit result in the destination general-purpose register.

But it is not clear what behavior should be expected for instructions that have implicit outputs, such as RDPMC. For both the RDMSR and RDTSC instructions, the low-order 32 bits of the result are put in EAX and the high-order 32-bits of the result are put in EDX, but in this case the instruction descriptions in Volume 2 of the SWDM says very clearly that the high-order 32-bits of the RAX and RDX register are cleared when this happens. For the RDPMC instruction, I see no mention of the behavior.

The Intel compiler generates a pair of instructions after each RDPMC

     MOV %edx, %edx
     MOV %eax, %eax

These are clearly covered by the "zero-extend" rule in Section 3.4.1.1 of Volume 1 of the SWDM, suggesting that the RDPMC instruction is not guaranteed to clear the high-order 32 bits of RAX and RDX on execution.

Request: It would be nice for this to be clarified in the instruction description in Volume 2.

As an aid to code optimization, it would also be nice to know if RDPMC is at least guaranteed not to set any of the high-order bits in RAX and/or RDX --- that would allow me to clear the 64-bit registers above the loop and not repeat that clearing operation inside the loop.

I noticed that gcc/6.3.0 generates 8 instructions instead of 10 instructions (icc 2018 update 2) for a loop that repeatedly executes RDPMC, merges the output to a 64-bit register and then stores it to an array. The differences are:

gcc/6.3.0 recognizes that %ecx (containing the counter number) is not modified in the loop, so it is hoisted. icc/2018.2 copies the counter number from a temporary register back into %ecx for every iteration.
gcc/6.3.0 notices that the "MOV %edx, %edx" is redundant, since %rdx will be shifted left by 32 bits anyway before being OR'd with %rax. That 32-bit left shift is guaranteed to eliminate anything that might have been in the high-order bits. (The "MOV %eax, %eax" is retained in the binary produced by gcc/6.3.0.)

Earlier versions of gcc used to do a different trick -- they deleted the clearing of the high-order bits, the shifting of %edx, and the OR'ing of the results (before executing a 64-bit store) and simply executed a 32-bit store of %eax followed by a 32-bit store of %edx. Very sneaky.

MarkC_Intel · ‎06-20-2018

Hi John, for the first post, we are working on updating the SDM. Thanks for pointing out the ambiguity.

For the 2nd post, as far as the ICC goes, I spoke w/one of our developers and he said RDPMC is assumed to write 32b values in the implicit regs. So we are not sure where that zeroing idiom you cite is coming from. He suggests it might be from inline asm constraints specified by user program. Guess we'd need more information if you want to pursue that further.

From an architecture perspective, whenever the processor writes 32b values in to a general purpose register, implicit or explicit, the upper 32b are zeroed.

McCalpinJohn · ‎06-20-2018

Thanks, Mark!

I am still trying to understand where the extra cruft is being added to the assembly code. Right now my code only does the RDPMC instruction in a single-instruction inline assembly macro, with the results handled in C code. I will try to put the post-processing of %eax and %edx into the assembly code and see if that helps the compiler eliminate the extra instructions. The overhead in cycles is clearly dominated by the microcoded RDPMC instruction, but I would like to reduce the instruction count overhead while I am working on this code....

McCalpinJohn · ‎06-20-2018

I managed to get a working version of the inline assembly that does an RDTSCP, the shift, and the OR in a single block. This eliminates the extraneous MOV instructions used for clearing the upper 32 bits, but (as I expected) does not make very much difference in the overhead -- the average is about 1 cycle faster (~20.5 cycles vs ~21.5 cycles). The RDPMC instructions are a bit faster (on SKX), with an average overhead of about 19 cycles. There is little point in worrying about overheads at this level -- OOO effects muddy the issues much more than overhead at these fine scales....

McCalpinJohn · ‎06-22-2018

Now that I know that RDPMC is guaranteed to generate "clean" 64-bit results in EAX and EDX, I changed the variables I was using in the surrounding C code from uint32_t to uint64_t. This allowed me to eliminate the casts to uint64_t, which turned out to be the source of the "movl %eax,%eax" instructions.

Now the assembly code is pretty tight. The Intel compiler still insists on re-loading %ecx on every loop iteration, even though it is not modified in the loop, nor is it in a clobber list. The gcc/6.3.0 compiler hoists that assignment out of the loop. The difference may or may not be measurable, I am still getting a few weird results....

jack__preter · ‎11-17-2018

I have also faced the same error on my printer, and my Epson printer not printing, so I contacted the service center and get all the required solutions to solve the error.