Using Large Pages under 64-bit Linux and the MPI stack
I was wondering if there has been any performance
comparison of the various (or any) Intel MPI messaging functions on an Intel64 platform
using large Linux pages (2MiB) vs the "standard" 4KiB pages.
are several benefits when large pagers are used, including the
reduction in TLB miss rates and the larger space for pre-fetching (2MiB
MPI stacks especially those coupled with a RDMA based communications stack on other platforms has shown improvements when large pages are used to transfer large messages as the VM operations do not get as much in the way.
Is there any direction in Intel to make large pages more easily accessible to apps vs the Intel MPI stack?
You've beat this to death, and now you're grasping at straws. In the MPI applications we tested, all the benefits and penalties of huge pages came into play without tinkering with message passing. I find it hard to imagine that an IB or other rdma capable adaptor would be designed so as to depend on huge pages to pass extremely large messages on the architectures which support primarily 4KB pages. It seems a stretch to imagine that weakness in prefetching impacts performance of rdma messages of >4KB. If you can persuade someone to support 64K pages on their future architecture, then it might be logical to assure that MPI takes advantage of it.
As you noticed I am just asking and not promoting anything. I admit I don't have much experience with large pages on Linux/Intell clusters and that is exactly why I am looking into this topic. I do have personal experience in UNIX/RISC HPC environmnets were both compute bound tasks and RDMA capable transports gain tangible performance improvement with large pages.
It may be the case that the different h/w prefetchers can make better use of the large pages in other platforms. For RDMA and User-Level types of communications stacks mapping large pages of messaging buffers directly to user space saves the kernel overhead to map multiple tinny pages vs a large one. At least this is wha happens in the UNIX/RISC environmnet I am familiar.
From my questions to these forums you can see I am a newcommer in the Intel64/Linux world however, I am not unfamiliar with processor or system level issues. I have been installing and learning the Intel platform, s/w tools and MPI stacks in our recent iDataPlex nehalem-EP cluster since this Spring 2010 (see here "http://sc.tamu.edu/systems/eos"). I do have experience with the UNIX/RISC world ("http://sc.tamu.edu/systems/hydra") and for me it is great I can compare head-to-head at all interesting levels these two seemingly different but actually converging approaches.
I am interested in investigating things myself, making comparisons and drawing even headed conclussions.
I am glad that you are replying to my questions because I feel you have a deeper exposure to Intel plaform internals than most people. I am glad because my objecive is to stire up intelligent discussion and get some answers not available in Intel's documentation.
Believe me I have spent a lot of effort to consolidate and correlate relevant information in one coherent place for both Intel64/nehalem/IB and UNIX/Power5+/HPS
As a matter of fact this lack of a single place were the detailed system operations and the COST of operations is presented is very FRUSTRATING to people who try to squeeze the most out of the platform. Please pass this comment on to Intel people
Pls keep responding to my questions with the usual insightful contents. And also feel free to inquire about the equvalent things in the UNIX/RISC world so we can have a mutually useful and informative exchange.
And YES I would love to see 64KiB pages supported by I am not sure how much needs to get retrofit into the VM/TLB design on Intel64.