we develope cache aware iterative solvers. This group of algorithms is mainly memory access limited. With 3D problems we face an increase in Data TLB misses. One solution seemed to be the usage of large (4M) pages.
We recognized several strange effects by switching to large pages. As bottomline: D-TLB misses and L2 DCM dropped, but performance in terms of runtime increased, roughly by a factor of 2.
To pin down the problem I did two tests: 1) I have several assembler implementations of memcpy. One uses software prefetching (mov in register). The version with "hand prefetching" showed exactly the same performance for 4k and 4M pages, while the standard version showed a large decrease.
2) I checked the vector triad with an SSE2 assembler implementation. So in this case prefetching should play no role. With 4M pages I get rougly half of the performance than with 4k pages.
I have two questions:
* Is the hardware prefetching disabled for large pages?
* Is there any issue with SSE2 instructions and large pages?
Is there any other point I didnt recognize?
Thanks in advance for you help,
PS: Just for completeness: The codes are exactly the same for different page sizes. We use the mmap call to allocate memory on a hugetlbfs on linux. To use it, we overwrite malloc and LD_PRELOAD the implementing library.
The operating system is Linux with a 2.6.5 Kernel. As all benchmarks are written in assembler the compiler is no issue. But this effects can also be seen with C Code.
I'll try to be brief, as my previous replies have been deleted. I think that page size of 16K or 64K would have a better chance of helping out with the concerns you expressed. A 4M page size appears to be attempting to improve performance only for large stride access within arrays of over 10MB, at the expense of other factors. You raise an interesting poing about hardware prefetch. If P4 hardware prefetch reach is limited to 4K, maybe the hoped for advantage of larger pages is defeated. I suspect there will be more need for TLB miss mitigation in the 64-bit OS, and with larger cache. Wouldn't those systems be more interesting in the future for solving problems such as you mention?
We have large strides. Consider a typical 3D simulation. Typical sizes of the 3D arrays used are e.g. 256x256x256 points, that are around 260MB for each array. If you used stencil based codes with regular grids you access [i(+-1)][j(+-)][k(+-)]. So for every point update you access five different pages. And you update each point several times. As you may aggree this produces lots of TLB misses.
You speak of expense of other factors, as I use large pages only for the large arrays memory fragmentation should be no issue, especially with multiple page size support, as present on linux. Is there any other factor I miss?
So using large pages is sensible and should increase performance.
You didnt ask my questions. The behaviour of the P4 CPU with large pages is not documented. There are some words about 4k restrictions in the optimization handbook, but it is never explicitly mentioned with regard to large pages. Do you think the 4k page boundary issue with the prefetcher causes the problem? It sould be interesting to try this with the Prescott, as its prefetcher is not anymore limited to the 4k boundary.
To answer you last question: TLB misses are a problem in scientific applications now. So this is independant of 64bit. If the 64bit address space is necessary? It depends on the problem, but of course there are many problems, which are limited by the ammount of memory available.
Message Edited by moebiusband on 11-04-200406:01 AM
Your stencil based code should have no difficulty with DTLB, if your inner loop is over the stride 1 subscript, as you must do for any opportunity of vectorization. There should be no problem with having 5 pages active, if the same 5 pages are in use over a large number of inner loop iterations. If you must loop over the largest stride subscript, you will certainly have problems with DTLB miss, at any reasonable page size. For looping over the middle subscript, I would not be surprised to see trouble, but would hope something could be done with page size. I agree that we have important un-answered questions. I have some evidence it may not be independent of 64-bit OS, it may become a more important problem there. I don't disagree that it may already be a problem in the 32-bit OS.
So the stride is one for every stream I need and I need 7 streams altogether. In the actual update I access 5 different pages. These 5 pages are the same inside for around one line in the 3D grid. There may be around 65536 lines. In every new line I access 5 new pages. This thing gets even worse if you apply loop blocking techniques were the data paths get more complicated and the pages you access increase.
The above numbers are only a guess. I have some performance counter numbers for 2D where the TLB misses decrease by a factor of around 800 by using large pages.
Also I still want to show that large pages can give you an advantage. I did preliminary experiments for in cache vector triad. with 4k pages you can recognize a smear out of the performance for sizes larger than 256k. So the performance doesnt decrease sharply when you drop out of cache as you might expect but it degrades earlier. This is clearly caused by the small D TLB. With large pages performance stays constant for the whole cache size, unfortunatly at a lower level.
Anyway, thank you for your help.
If you are interested I can give you the results of our tests, this time backed by measurements, after we have finished.