I am currently writing a C++ class which measures performance using the RDPMC instruction.
Everything works as expected, but I noticed in the manual that some of the processors support "fast" mode of the RDPMC instruction (reading only the lower 32 bits of the counter). When I try to do it on mine (i.e. switching the ECX) the code produces seg fault.
This mode is supported on processors with 40 bit counters and the counters on my machine are 48 bit. The model name of my processor is "Intel(R) Xeon(R) CPU W3580".
I was wondering if there is some equivalent of this "fast" mode for different processors and if not if it's possible to reduce the number of cycles which this instruction takes (currently ~30 cycles).
The "fast mode" for the RDPMC instruction is only relevant to some old processors -- Pentium 4 era, if I recall correctly.
The ~30 cycles required for the RDPMC instruction appears to be a minimum requirement for the microcode, which executes about 35 uops for each RDPMC instruction.
Because the RDPMC instruction is not ordered with respect to surrounding instructions, it is not clear that making it execute in fewer cycles would result in more accurate measurements.
On the other hand, I typically read 11 performance counters at a time (8 programmable plus 3 fixed-function), which starts getting into non-trivial elapsed time -- about 128 ns at 3 GHz -- so I would certainly not mind if the instruction was faster (or if there were an option to read multiple counters with a single instruction).
Hi John --
Thanks for your warning. Like Georgi, I was tricked by the Intel Reference Manual into thinking that this mode still existed for current Xeon processors:
The Pentium 4 and Intel Xeon processors also support “fast” (32-bit) and “slow” (40-bit) reads on the first 18 performance counters. Selected this option using ECX. If bit 31 is set, RDPMC reads only the low 32 bits of the selected performance counter. If bit 31 is clear, all 40 bits are read. A 32-bit result is returned in EAX and EDX is set to 0. A 32-bit read executes faster on Pentium 4 processors and Intel Xeon processors than a full 40-bit read.
Perhaps someone from Intel could see about making this passage in the manual less of a trap for the unwary?
Yup -- the word "Xeon" is somewhat overloaded in the documentation.... In this case the giveaway is the reference to "the first 18 performance counters", since none of the more recent processors have this many....