Solved: Re: Understanding my Benchmarks

Matthias_Kretz · ‎11-10-2009

Hi,
I wote a benchmark to compare thpossible speedup with SSE vs. scalar execution. But I don't undestand the results I get.
The following loop:
loop:
movaps 0x10(%rax),%xmm1
cmpltps %xmm1,%xmm0
movaps 0x20(%rax),%xmm0
cmpltps %xmm0,%xmm1
movaps 0x30(%rax),%xmm1
cmpltps %xmm1,%xmm0
add $0x40,%rax
movaps (%rax),%xmm0
cmpltps %xmm0,%xmm1
cmp %rax,%rbx
ja loop
appears to require ~2 cycles per movaps+cmpltps (8 cycles per iteration) on a Nehalem processor. (The memory it iterates over is of a size < L1 size.)

The generated code for the scalar case looks like this:
loop:
movss 0x4(%rax),%xmm1
ucomiss %xmm0,%xmm1
seta %dl
movss 0x8(%rax),%xmm0
ucomiss %xmm1,%xmm0
seta %dl
movss 0xc(%rax),%xmm1
ucomiss %xmm0,%xmm1
seta %dl
add $0x10,%rax
movss (%rax),%xmm0
ucomiss %xmm1,%xmm0
seta %dl
cmp %rax,%rbx
ja loop
This requires ~1.33 cycles per ucomiss (i.e. 5.33 cycles per iteration) on the same processor. (Same memory size, too.)

The result is that to compare N floats with SSE I need N/2 cycles. Without SSE I need 1.33*N cycles. That's a speedup of factor 2.66. I expected something closer to a factor of 4 than that...

Now I'm trying to understand where this comes from:
1. the cmpps result is not used, therefore only the throughput should count, i.e. I can execute one cmpps per cycle. Do the movaps account for the second cycle? Could the movaps execute in parallel with cmpps if they'd use a different register?
2. The ucomiss call has a latency of 1 cycle. The result of set is not used, therefore the instruction can run in parallel with everything else. The movss instruction can execute in parallel to the previous ucomiss and seta. So 1.33 looks sensible, but I can't fully understand where this comes from.
Question: Does the second call to seta have to wait for the first one to retire because it writes to the same register?

Anybody that can help me to understand instruction level parallelism better?

Max_L · ‎11-20-2009

Matthias - hi,

L1 access is 128-bit load + 128-bit store per cycle since Core2, including Nehalem.

I'm measuring your SSE loop taking exactly 4 cycles/iteration on sub-L1 size data-set, as expected.

Be careful using rdtsc and power management or turbo mode turned on - RDSTC would not reflect real CPU frequency (thus would not measure cycles precisely), but will rather count bus cycles multiplied by a factory fused nominal bus multiplier.

What happens is you are most likely not measuring long enough (<1 sec) to "warmup" your CPU to switch from some P-state/lower frequency it resides in (~2X lower freq. is normal and explains your results) to nominal or turbo frequency (you don't want turbo for measuring cycles too), but rdtsc counts as if frequency was normal.

So at least you need to disable EIST and Turbo in the BIOS. And try disabling power features of your OS too. You also need to serialize execution stream e.g. with XOR EAX, EAX; CPUID before RDTSC.

Thank you,

-Max

View solution in original post

Tal_U_Intel · ‎11-14-2009

Quoting - Matthias Kretz

Hi,
I wote a benchmark to compare thpossible speedup with SSE vs. scalar execution. But I don't undestand the results I get.
The following loop:
loop:
movaps 0x10(%rax),%xmm1
cmpltps %xmm1,%xmm0
movaps 0x20(%rax),%xmm0
cmpltps %xmm0,%xmm1
movaps 0x30(%rax),%xmm1
cmpltps %xmm1,%xmm0
add $0x40,%rax
movaps (%rax),%xmm0
cmpltps %xmm0,%xmm1
cmp %rax,%rbx
ja loop
appears to require ~2 cycles per movaps+cmpltps (8 cycles per iteration) on a Nehalem processor. (The memory it iterates over is of a size < L1 size.)

The generated code for the scalar case looks like this:
loop:
movss 0x4(%rax),%xmm1
ucomiss %xmm0,%xmm1
seta %dl
movss 0x8(%rax),%xmm0
ucomiss %xmm1,%xmm0
seta %dl
movss 0xc(%rax),%xmm1
ucomiss %xmm0,%xmm1
seta %dl
add $0x10,%rax
movss (%rax),%xmm0
ucomiss %xmm1,%xmm0
seta %dl
cmp %rax,%rbx
ja loop
This requires ~1.33 cycles per ucomiss (i.e. 5.33 cycles per iteration) on the same processor. (Same memory size, too.)

The result is that to compare N floats with SSE I need N/2 cycles. Without SSE I need 1.33*N cycles. That's a speedup of factor 2.66. I expected something closer to a factor of 4 than that...

Now I'm trying to understand where this comes from:
1. the cmpps result is not used, therefore only the throughput should count, i.e. I can execute one cmpps per cycle. Do the movaps account for the second cycle? Could the movaps execute in parallel with cmpps if they'd use a different register?
2. The ucomiss call has a latency of 1 cycle. The result of set is not used, therefore the instruction can run in parallel with everything else. The movss instruction can execute in parallel to the previous ucomiss and seta. So 1.33 looks sensible, but I can't fully understand where this comes from.
Question: Does the second call to seta have to wait for the first one to retire because it writes to the same register?

Anybody that can help me to understand instruction level parallelism better?

Hello,

The cycle counts that oyu've mentioned above,
Did you measure them on an Intel microArchitecture - codename Nehalem processor? Or are you doing theoretical estimations?

Thanks,
Tal

Matthias_Kretz · ‎11-19-2009

Quoting - Tal Uliel (Intel)

Hello,

The cycle counts that oyu've mentioned above,
Did you measure them on an Intel microArchitecture - codename Nehalem processor? Or are you doing theoretical estimations?

They are measured using the rdtsc instruction on a Xeon E5520. I would like to be able to understand the theory, though.
In the meantime I found out, that the CPU seems to be limited to 64 Bit/cycle throughput from L1 which is why the SSE compares cannot execute in one cycle (verified by only executing loads). After reading the Intel docs this is not what I had expected, though. I understand from the Intel Optimization Reference Manual that the Nehalem architecture is able to do one 128 Bit load plus one 128 Bit store per cycle.

Tal_U_Intel · ‎11-19-2009

Quoting - Matthias Kretz

They are measured using the rdtsc instruction on a Xeon E5520. I would like to be able to understand the theory, though.
In the meantime I found out, that the CPU seems to be limited to 64 Bit/cycle throughput from L1 which is why the SSE compares cannot execute in one cycle (verified by only executing loads). After reading the Intel docs this is not what I had expected, though. I understand from the Intel Optimization Reference Manual that the Nehalem architecture is able to do one 128 Bit load plus one 128 Bit store per cycle.

You can Use Intel Architecture Code Analyzer to analyze the two blocks you published above.
use -arch nehalem to set the Analyzer forIntel microArchitecture - codename Nehalem analysis.

I hope the analyzer will help you understand the theoretical performance analysis.

I'm not sure why you got these results when you ran this code on Xeon E5520, but acording to the anlaysis I've expected both code blocks to have 4 cycles throughput on a Intel microArchitecture - codename Nehalem processor.

Tal

Max_L · ‎11-20-2009

Matthias - hi,

L1 access is 128-bit load + 128-bit store per cycle since Core2, including Nehalem.

I'm measuring your SSE loop taking exactly 4 cycles/iteration on sub-L1 size data-set, as expected.

Be careful using rdtsc and power management or turbo mode turned on - RDSTC would not reflect real CPU frequency (thus would not measure cycles precisely), but will rather count bus cycles multiplied by a factory fused nominal bus multiplier.

What happens is you are most likely not measuring long enough (<1 sec) to "warmup" your CPU to switch from some P-state/lower frequency it resides in (~2X lower freq. is normal and explains your results) to nominal or turbo frequency (you don't want turbo for measuring cycles too), but rdtsc counts as if frequency was normal.

So at least you need to disable EIST and Turbo in the BIOS. And try disabling power features of your OS too. You also need to serialize execution stream e.g. with XOR EAX, EAX; CPUID before RDTSC.

Thank you,

-Max

Matthias_Kretz · ‎11-24-2009

Quoting - Max Locktyukhin (Intel)

L1 access is 128-bit load + 128-bit store per cycle since Core2, including Nehalem.
I'm measuring your SSE loop taking exactly 4 cycles/iteration on sub-L1 size data-set, as expected.

Be careful using rdtsc and power management or turbo mode turned on - RDSTC would not reflect real CPU frequency (thus would not measure cycles precisely), but will rather count bus cycles multiplied by a factory fused nominal bus multiplier.

What happens is you are most likely not measuring long enough (<1 sec) to "warmup" your CPU to switch from some P-state/lower frequency it resides in (~2X lower freq. is normal and explains your results) to nominal or turbo frequency (you don't want turbo for measuring cycles too), but rdtsc counts as if frequency was normal.

So at least you need to disable EIST and Turbo in the BIOS. And try disabling power features of your OS too. You also need to serialize execution stream e.g. with XOR EAX, EAX; CPUID before RDTSC.

Thanks,

I am getting saner results now. I think the most important problem was that the test loop was too small, resulting in only ~1000 cycles being executed in between un-serialized rdtsc calls. Fixed all those and modified the BIOS settings on the machine to make it more useful for benchmarking.