Fluctuating FLOP count on Sandy Bridge

styc · ‎03-05-2013

Compile the following program with 'icc -xAVX -std=c99'

[cpp]#define N 20000000
#define M 200000
double a[N / M], b[N / M], c[N / M];
int main()
{
for (int j = 0; j < M; j++)
for (int i = 0; i < N / M; i++)
c += a * b;
}[/cpp]

and measure the SIMD_FP_256.PACKED_DOUBLE event with 'perf stat -r 100 -e r211' on Sandy Bridge. In theory, there should be 10,000,000 counts of that event (2 FLOP/triplet * 20,000,000 triplets / (4 FLOP/instruction) = 10,000,000 instructions). But actual numbers that I got fluctuate in a wide range depending on the value of M. When M = 200,000, the number is very close to 10,000,000; when M = 5, however, it gets as high as 16,800,000, or 68% larger than the expected number. How can I remove such fluctuation?

McCalpinJohn · ‎03-27-2013

The floating point performance counter events on Sandy Bridge cores are incremented every time the instruction(s) are issued, not just when the instruction(s) are retired. Unfortunately for FLOP counters, if one of the operands is from memory and is not in the cache, the instruction will be re-issued multiple times until it finds the data in the cache.

It might be possible to get closer to the desired values by hacking the assembly code to use "MOVE" instructions to load the input data into registers, and then only use register inputs to the floating-point instructions. My first attempt at this failed, but that might have been because I did not use independent registers for the MOVE instructions.

I.e., I changed code like

vmulpd a(offset),%ymm2,%ymm1

into

vmovupd a(offset), %ymm1

vmulpd %ymm1,%ymm2,%ymm1

I can imagine that the assembler might have merged these or the hardware might have merged these, or the hardware might be re-issuing the FP operations anyway, even though it is the preceding load instruction that is actually stalled waiting on memory.

styc · ‎03-29-2013

John D. McCalpin wrote:

The floating point performance counter events on Sandy Bridge cores are incremented every time the instruction(s) are issued, not just when the instruction(s) are retired. Unfortunately for FLOP counters, if one of the operands is from memory and is not in the cache, the instruction will be re-issued multiple times until it finds the data in the cache.

It might be possible to get closer to the desired values by hacking the assembly code to use "MOVE" instructions to load the input data into registers, and then only use register inputs to the floating-point instructions. My first attempt at this failed, but that might have been because I did not use independent registers for the MOVE instructions.

I.e., I changed code like

vmulpd a(offset),%ymm2,%ymm1

into

vmovupd a(offset), %ymm1

vmulpd %ymm1,%ymm2,%ymm1

I can imagine that the assembler might have merged these or the hardware might have merged these, or the hardware might be re-issuing the FP operations anyway, even though it is the preceding load instruction that is actually stalled waiting on memory.

If the events are not counted at retire-time but issue-time, then probably there is no fundamental solution to the problem. Perhaps it is better not to measure FLOPS at all. Ivy Bridge and Haswell seemingly do not support any FLOP-related events.

Bernard · ‎03-30-2013

I do not know the exact physical implementation of YMMn registers , but at hardware level one architectural YMM register could be mapped to more than one physical register(location) so there could be or could not be the issue with the independent registers.