- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Compile the following program with 'icc -xAVX -std=c99'
[cpp]#define N 20000000
#define M 200000
double a[N / M], b[N / M], c[N / M];
int main()
{
for (int j = 0; j < M; j++)
for (int i = 0; i < N / M; i++)
c += a * b;
}[/cpp]
and measure the SIMD_FP_256.PACKED_DOUBLE event with 'perf stat -r 100 -e r211' on Sandy Bridge. In theory, there should be 10,000,000 counts of that event (2 FLOP/triplet * 20,000,000 triplets / (4 FLOP/instruction) = 10,000,000 instructions). But actual numbers that I got fluctuate in a wide range depending on the value of M. When M = 200,000, the number is very close to 10,000,000; when M = 5, however, it gets as high as 16,800,000, or 68% larger than the expected number. How can I remove such fluctuation?
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
The floating point performance counter events on Sandy Bridge cores are incremented every time the instruction(s) are issued, not just when the instruction(s) are retired. Unfortunately for FLOP counters, if one of the operands is from memory and is not in the cache, the instruction will be re-issued multiple times until it finds the data in the cache.
It might be possible to get closer to the desired values by hacking the assembly code to use "MOVE" instructions to load the input data into registers, and then only use register inputs to the floating-point instructions. My first attempt at this failed, but that might have been because I did not use independent registers for the MOVE instructions.
I.e., I changed code like
vmulpd a(offset),%ymm2,%ymm1
into
vmovupd a(offset), %ymm1
vmulpd %ymm1,%ymm2,%ymm1
I can imagine that the assembler might have merged these or the hardware might have merged these, or the hardware might be re-issuing the FP operations anyway, even though it is the preceding load instruction that is actually stalled waiting on memory.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
John D. McCalpin wrote:
The floating point performance counter events on Sandy Bridge cores are incremented every time the instruction(s) are issued, not just when the instruction(s) are retired. Unfortunately for FLOP counters, if one of the operands is from memory and is not in the cache, the instruction will be re-issued multiple times until it finds the data in the cache.
It might be possible to get closer to the desired values by hacking the assembly code to use "MOVE" instructions to load the input data into registers, and then only use register inputs to the floating-point instructions. My first attempt at this failed, but that might have been because I did not use independent registers for the MOVE instructions.
I.e., I changed code like
vmulpd a(offset),%ymm2,%ymm1
into
vmovupd a(offset), %ymm1
vmulpd %ymm1,%ymm2,%ymm1
I can imagine that the assembler might have merged these or the hardware might have merged these, or the hardware might be re-issuing the FP operations anyway, even though it is the preceding load instruction that is actually stalled waiting on memory.
If the events are not counted at retire-time but issue-time, then probably there is no fundamental solution to the problem. Perhaps it is better not to measure FLOPS at all. Ivy Bridge and Haswell seemingly do not support any FLOP-related events.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page