Your Question was:Re: memory I/O performance problems with Cran

todd_allen · ‎02-14-2006

These questions are related to a performance problem on
Cranford and Paxville chips with a program like the
following:

static int it[100000000];

int main()
{
int i;
for (i = 0; i < 10000; i++) {
__asm__ __volatile__("movl $0,%eax");
__asm__ __volatile__("loop:");
__asm__ __volatile__("cmpl $100000000,%eax");
__asm__ __volatile__("jge done");
__asm__ __volatile__("movl $0,it(%eax)");
__asm__ __volatile__("addl $10000,%eax");
__asm__ __volatile__("jmp loop");
__asm__ __volatile__("done:");
}
}

The real guts of the code are written in gcc's in-line
assembly to avoid vagaries of different compilers and
versions. The outer loop using i is present to repeat the
operation enough times so that the code runs for long enough
for measurement to be easy, and so that the cache might come
into play. I didn't bother to write it in assembly because
it should add negligible overhead.

That code takes the following times on the following
processors:
Xeon 1.4GHz (Foster C0) (stepping 0F11h) 8.1s
Xeon MP 3.66GHz (Cranford A0) (stepping 0F41h) 16.0s
Dual-Core Xeon 2.66GHz (Paxville A0) (stepping 0F48) 16.6s
I'll be concerning myself only with the Foster C0 and
Cranford A0 from here on.

Q1: (the most important) Why is the code slower on a 3.66GHz
Cranford A0 chip than on a 1.4GHz Foster C0?

Now on to my own thoughts on the subject:

The code defines a 100,000,000 element array of integers and
then writes on 10,000 cells therein with a stride of 10,000
elements between each one.

I've considered several possibilities for why this
performance degradation might occur, but my best guess is
that it's related to the L2 cache. But the math just
doesn't seem to work out. The Cranford A0 has a 1Mb 8-way
set associative sectored L2 cache with 64 byte lines (as
deduced from CPUID function 2 code 0x7c). Given the
distance between successive elements (40,000 bytes), each
element would occupy its own cache line. There are
1024*1024/64 = 16384 cache lines available in the L2 cache.
But the above code is writing on only 10000 of them. That
means it's using only about 60% of the cache.

So, I considered that perhaps the stride prefetch or
adjacent cache line prefetch was causing additional cache
lines to be loaded when writing to each element. Because of
the large stride size, those extra cache lines would be of
no value, but would double the cache usage to 20000 cache
lines. That would mean the cache was being overutilized at
about 120%. So, as an experiment, I disabled both those
features by setting IA32_MISC_ENABLE[9] and
IA32_MISC_ENABLE[19] MSR bits. This had no effect.

But maybe the problem still is related. The Intel IA-32
Intel Architecture Optimization Reference Manual mentions in
its "Capacity Limits and Aliasing in Caches" section that
data in the L2 cache is loaded using the whole sector (2
cache lines). But that doesn't allow for the disabling of
adjacent cache line prefetch using IA32_MISC_ENABLE[19],
whose description says that it will load only a single cache
line. So perhaps that comment about always loading using
the whole sector was stated too generally, and applies only
to the usual case where IA32_MISC_ENABLE[19] was clear.

I have one guess that might explain the behavior seen above.
Perhaps the L2 cache always reserves whole sectors (2 cache
lines or 128 bytes), and the IA32_MISC_ENABLE[19] bit
controls only whether both cache lines are filled. If so,
this too would result in the cache being 120% utilized. The
program would be using 10000 sectors, but the cache would
have available only 1024*1024/128 = 8192 sectors available
for reservation.

Q2: Do the sectored caches always reserve whole sectors or,
if adjacent cache line prefetch is disabled, can their
two cache lines be used to store widely-separated data ?

But even if my guess is right, it isn't wholly satisfying.
The Foster C0 chip's largest cache is an L3 cache that's
512Kb, 4-way set associative, non-sectored, and with 64 byte
lines (as deduced from CPUID function 2 code 0x22). It
would have available only 512*1024/64 = 8192 cache lines.
So, it would be overutilizing its cache, too, also by 120%.
So, although the overutilization of the cache on the
Cranford A0 may be a performance problem, how can it account
for the difference?

Q3: Given that the Foster C0's cache is overutilized, can
the overutilization of the Cranford A0's cache (if it
even happens) be the source of the large performance
degradation?

Still, there is empirical evidence that a cache is a factor.
While keeping the stride a constant, I varied the size of
the array, and thus the number of elements being written.
On the Cranford A0, as the number of elements rose from
about 100 to about 3000 elements, the time remained roughly
constant. As the number of elements rose from 3000 to 8000,
there was a very sharp spike in total time which ended at
about 15 times the original time. The from 8000 to 10000,
the time remained roughly constant.

On the Foster C0, there was a similar effect, but not nearly
so pronounced. From about 100 to about 2000 elements, the
time remained roughly constant. As the number of elements
rose from 2000 to about 4000, there was an increase in the
time by about 22% -- much more reasonable and expected. For
numbers of elements above 4000, it remained roughly
constant.

Surely it can't be the case that the memory I/O itself is
dramatically slower on the Cranford A0?

Q4: As the number of elements being written rises, why is
there such a dramatic rise in total time on the Cranford
A0 (1400%), when it's so much less (22%) on the Foster
C0?

Now, here's a surprise. Change the above code to the
following:

static int it[100000000];

int main()
{
int i;
for (i = 0; i < 10000; i++) {
__asm__ __volatile__("movl $0,%eax");
__asm__ __volatile__("loop:");
__asm__ __volatile__("cmpl $100000000,%eax");
__asm__ __volatile__("jge done");
__asm__ __volatile__("prefetchnta it(%eax)");
__asm__ __volatile__("movl $0,it(%eax)");
__asm__ __volatile__("addl $10000,%eax");
__asm__ __volatile__("jmp loop");
__asm__ __volatile__("done:");
}
}

The change is the addition of a prefetchnta instruction
right before the mov of 0 to an array element.
Theoretically, the insertion of a prefe tch of a memory
location immediately before a reference to the same memory
location ought to have no effect. Or, if anything, it might
slow down the execution very slightly because of the
additional instruction decode and maybe putting some holes
in the pipeline. And this is exactly the effect is has on
the Foster C0: a very slight degradation. But on the
Cranford A0, it has a totally different effect. The time
goes from 16.0s to 4.0s! I can think of no explanation for
this effect. But it's as if, with the prefetchnta present,
the Cranford A0 is operating as I would have expected in the
first place.

Q5: Why does the insertion of a prefetch immediately before
a write to the same location improve performance? And
why by so much?

Just to round out this topic, I did consider a few other
explanations for the degradation that didn't pan out:

Non-uniform distribution of address to L2 cache sets, so
that only a fraction of the sets were being used:
No, the distribution should be almost perfectly
uniform.
TLB:
The stride is so large than each write is in a
different page. So, there would be 10,000 page table
entries in play and that would overpower the TLB on
both the Foster and Cranford.
Store buffer:
Each write would be to a different store buffer and
there are only a handful of them, so they also would
be overpowered, too.
Thermal throttling:
I checked the IA32_THERM_STATUS[0] and
IA32_THERM_STATUS[1] MSR bits to see if either had
been turned on, but neither had.

To reiterate the questions:

Q1: (the most important) Why is the code slower on a 3.66GHz
Cranford A0 chip than on a 1.4GHz Foster C0?
Q2: Do the sectored caches always reserve whole sectors or,
if adjacent cache line prefetch is disabled, can their
two cache lines be used to store widely-separated data ?
Q3: Given that the Foster C0's cache is overutilized, can
the overutilization of the Cranford A0's cache (if it
even happens) be the source of the large performance
degradation?
Q4: As the number of elements being written rises, why is
there such a dramatic rise in total time on the Cranford
A0 (1400%), when it's so much less (22%) on the Foster
C0?
Q5: Why does the insertion of a prefetch immediately before
a write to the same location improve performance? And
why by so much?

Thanks in advance.

TimP · ‎02-14-2006

That's interesting, and you seem to have considered most of the evident issues. At such a large stride, the hardware prefetch is not supposed to have any effect, and only single 64-byte cache lines should be fetched, as your experiment with MSR confirmed. You are testing mainly the performance in resolving DTLB misses, which certainly is an issue for this architecture. If that's what you mean by memory I/O performance, so be it.

todd_allen · ‎03-02-2006

I don't think the TLB can be the culprit. As I understand things, there are only 64 TLB entries. And with the large stride size, each write should have been to its own page and thus required its own TLB entry. That means, if the TLB was the culprit, the performance degradation should have started at about 64 entries and have been full-blown at maybe 2 or 3 times that (depending on the algorithm for retiring TLB entries). And there was some degradation with those numbers of entries. But the main problem didn't start happening until the number of entries was around 3000.

jim_dempsey · ‎03-03-2006

As an experiment, try replacing the prefetch with a test. Or insert a movl to other register. i.e. insert a non-destructive read. What may be happening is the pipeline tries to compute in advance of the "jge done" as long as the code is non-destructive. The movl to write will performa a read-modify-write so the additional mov to register will likely add no additional overhead. Your real code can use the prefetch as this won't trash a register.

A cmpl %eax,it(%eax) could also be used and this won't clobber the additional register.

The point of the test is not to provide a work around, but rather a better understanding of what might be happening.

Jim Dempsey

todd_allen · ‎03-03-2006

Well, that was a good experiment. If I inserted the cmpl %eax,it(%eax) instruction in the same spot where I previously had inserted the prefetch, the code runs at about the same speed as if I'd inserted no extra instruction.

Unfortunately, it makes the whole prefetch mystery even more mysterious.

jim_dempsey · ‎03-07-2006

That was what I expected you to see. You would not want to insert the dummy memory read (cmpl) in all instances. e.g. if you were using SSE3 128-bit instructions you don't want to read partialy the read completely. Try the following in your test code:

Code:

int main()
{
int i;
for (i = 0; i < 10000; i++) {
__asm__ __volatile__("movl $0,%eax");
__asm__ __volatile__("loop:");
__asm__ __volatile__("prefetchnta it(%eax)");
__asm__ __volatile__("cmpl $100000000,%eax");
__asm__ __volatile__("jge done");
__asm__ __volatile__("movl $0,it(%eax)");
__asm__ __volatile__("addl $10000,%eax");
__asm__ __volatile__("jmp loop");
__asm__ __volatile__("done:");
}
}

Note that the prefetch is placed earlier.

Also, although at first glance you may think you can place the prefetch at the bottom (before the "jmp loop") to get an earlier prefetch, it will be counter productive as eax is in a state of flux pipeline-wise. By placing it at the top of the loop you give the pipeline a little time to complete the addl.

Jim Dempsey

todd_allen · ‎03-09-2006

Moving the prefetchnta instruction higher like that made the code operate at about the same speed as having it inserted where I suggested. Or maybe something on the order of a 0.1% - 0.3% faster. I didn't get very vigorous in trying to pin down exactly how much faster or slower it was in the thousandths.

Can you give some idea where you're heading with this question? What you're suspecting when you propose these experiments?

jim_dempsey · ‎03-10-2006

Your Question was:

Q5: Why does the insertion of a prefetch immediately before
a write to the same location improve performance? And
why by so much?

The answer appears to be related to how the execution pipeline computes ahead of the code.
The prefetch performance difference confirms the pipeline did not perform an early start
on the read/modify/write. Note, although your code is performing write-only the 32-bit
write is narrower than the data path (128-bit) and thus necessitates the r/m/w.

Using the read or test of the same location to be writteh was simply used to confirm
that the problem is likely a pipeline coding issue (performance bug).

My guess is either the microcoder placed too high of a cost weight on the
movl immediate, location(index) or equaly likely the branch prediction logic
could not handle your loop.

To confirm branch prediction logic problem recode as follows:

for (i = 0; i < 10000; i++) {
__asm__ __volatile__("movl $0,%eax");
__asm__ __volatile__("jmp loopEntry");
__asm__ __volatile__("loop:");
__asm__ __volatile__("addl $10000,%eax");
__asm__ __volatile__("loopEntry:");
__asm__ __volatile__("cmpl $100000000,%eax");
__asm__ __volatile__("jge done");
__asm__ __volatile__("movl $0,it(%eax)");
__asm__ __volatile__("jmp loop");
__asm__ __volatile__("done:");
}

The above is without the prefetch.

If the above code witout the prefetch runs faster than your first code
example without prefetch then this might indicate a branch prediction
problem in the microode.

Jim Dempsey

todd_allen · ‎03-10-2006

I thought the P4 branch prediction logic was supposed to maintain a table internally of branch instructions and whether each tended to be taken or not taken, and to base its decision on that. I'm ignoring the fallback mechanism of branch-forward-not-taken, branch-backward-taken because this is happening in a tight loop and so the branch should be in its table for all by the first iteration.

Anyway, that reworked loop runs at about the same speed as the original with no prefetches.

todd_allen · ‎03-13-2006

I had a chance to try this on a Potomac C0 chip. It has an 8Mb L3 cache. Even assuming it always allocates cache lines as whole sectors of 128 bytes each, that's 65536 sectors. That should be plenty. It should have no problem holding each of the 10000 array cells on which it writes in the cache. The original code runs in about 4.87s on this machine. With the prefetchnta added right before the write, it runs in about 2.98s.

memory I/O performance problems with Cranford & Paxville chips