Pentium 4 Large Page questions

moebiusband · ‎11-03-2004

Hi,

we develope cache aware iterative solvers.
This group of algorithms is mainly memory
access limited. With 3D problems we face
an increase in Data TLB misses. One solution
seemed to be the usage of large (4M) pages.

We recognized several strange effects by switching
to large pages. As bottomline: D-TLB misses and L2
DCM dropped, but performance in terms of runtime
increased, roughly by a factor of 2.

To pin down the problem I did two tests:
1) I have several assembler implementations of memcpy.
One uses software prefetching (mov in register). The
version with "hand prefetching" showed exactly the same
performance for 4k and 4M pages, while the standard
version showed a large decrease.

2) I checked the vector triad with an SSE2 assembler
implementation. So in this case prefetching should play
no role. With 4M pages I get rougly half of the performance
than with 4k pages.

I have two questions:

* Is the hardware prefetching disabled for large pages?

* Is there any issue with SSE2 instructions and large pages?

Is there any other point I didnt recognize?

Thanks in advance for you help,

Jan Treibig

PS:
Just for completeness:
The codes are exactly the same for different page sizes.
We use the mmap call to allocate memory
on a hugetlbfs on linux. To use it, we overwrite
malloc and LD_PRELOAD the implementing library.

The operating system is Linux with a 2.6.5 Kernel.
As all benchmarks are written in assembler the compiler
is no issue. But this effects can also be seen with C Code.

TimP · ‎11-04-2004

I'll try to be brief, as my previous replies have been deleted.
I think that page size of 16K or 64K would have a better chance of helping out with the concerns you expressed. A 4M page size appears to be attempting to improve performance only for large stride access within arrays of over 10MB, at the expense of other factors.
You raise an interesting poing about hardware prefetch. If P4 hardware prefetch reach is limited to 4K, maybe the hoped for advantage of larger pages is defeated.
I suspect there will be more need for TLB miss mitigation in the 64-bit OS, and with larger cache. Wouldn't those systems be more interesting in the future for solving problems such as you mention?

Message Edited by tim18 on 11-04-2004 04:51 AM

moebiusband · ‎11-04-2004

Hi,
I missed some important point:
For the vector triad the prefetching plays no role because
I adjusted the size, so that it runs completely in cache.

About the System:
It is a Pentium 4 Northwood 2.8 GHz
on a i865 Board with dual channel DDR400 Ram.

moebiusband · ‎11-04-2004

Hi,
thanks for your answer.

We have large strides. Consider a typical 3D
simulation. Typical sizes of the 3D arrays used are
e.g. 256x256x256 points, that are around 260MB for
each array. If you used stencil based codes with
regular grids you access
[i(+-1)][j(+-)][k(+-)]. So for every point update
you access five different pages. And you update
each point several times. As you may aggree this
produces lots of TLB misses.

You speak of expense of other factors, as I use
large pages only for the large arrays memory
fragmentation should be no issue, especially
with multiple page size support, as present on linux.
Is there any other factor I miss?

So using large pages is sensible and should increase
performance.

You didnt ask my questions. The behaviour of the P4
CPU with large pages is not documented. There are
some words about 4k restrictions in the optimization
handbook, but it is never explicitly mentioned with
regard to large pages. Do you think the 4k page
boundary issue with the prefetcher causes the problem?
It sould be interesting to try this with the Prescott, as
its prefetcher is not anymore limited to the 4k boundary.

To answer you last question:
TLB misses are a problem in scientific applications now.
So this is independant of 64bit. If the 64bit address space
is necessary? It depends on the problem, but of course there
are many problems, which are limited by the ammount of
memory available.

Jan Treibig

Message Edited by moebiusband on 11-04-2004 06:01 AM

TimP · ‎11-04-2004

Your stencil based code should have no difficulty with DTLB, if your inner loop is over the stride 1 subscript, as you must do for any opportunity of vectorization. There should be no problem with having 5 pages active, if the same 5 pages are in use over a large number of inner loop iterations.
If you must loop over the largest stride subscript, you will certainly have problems with DTLB miss, at any reasonable page size. For looping over the middle subscript, I would not be surprised to see trouble, but would hope something could be done with page size.
I agree that we have important un-answered questions. I have some evidence it may not be independent of 64-bit OS, it may become a more important problem there. I don't disagree that it may already be a problem in the 32-bit OS.

moebiusband · ‎11-04-2004

Lets consider the following pseudo code.

for (it=0;it< itermax; it++){

for (i=1;i< loopcount;i++){
for (j=1;j< loopcount;j++){
for (k=1;k< loopcount;k++){

Ut = factor* ( U[i+1]
+U[i-1]
+U [j+1]
+U [j-1]
+U [k+1]
+U [k-1]
-dh2 * F );
}
}
}
}

So the stride is one for every stream I need and I
need 7 streams altogether.
In the actual update I access 5 different pages.
These 5 pages are the same inside for around one
line in the 3D grid. There may be around 65536 lines.
In every new line I access 5 new pages.
This thing gets even worse if you apply loop
blocking techniques were the data paths get more complicated
and the pages you access increase.

The above numbers are only a guess. I have some
performance counter numbers for 2D where the TLB
misses decrease by a factor of around 800 by using
large pages.

Also I still want to show that large pages can give you
an advantage. I did preliminary experiments for in
cache vector triad. with 4k pages you can recognize
a smear out of the performance for sizes larger than 256k.
So the performance doesnt decrease sharply when you drop
out of cache as you might expect but it degrades earlier.
This is clearly caused by the small D TLB.
With large pages performance stays constant for the whole
cache size, unfortunatly at a lower level.

Anyway, thank you for your help.

If you are interested I can give you the results of our
tests, this time backed by measurements, after we have
finished.

Jan