I was wondering if there has been any performance comparison of the various (or any) MKL functions on an Intel64 platform using large Linux pages (2MiB) vs the "standard" 4KiB pages.
There are several benefits when large pagers are used, including the reduction in TLB miss rates and the larger space for pre-fetching (2MiB vs 4KiB).
I was wondering if there is any other way to utilize large pages besides the hugeTLBfs library.
In other systems one can specify the page size per segment and can obtain tangible speedups in applications suffering from TLB shortages. On that note a smaller page size (eg, 64 to 128 KiB) would be preferrable as a better alternative to 2MiB (or the 1GiB) page size.
Is there any direction in Intel to make large pages more easily accessible to apps ?
The MKL team has done some internal testing with huge pages. And you are correct: there are benefits to using larger pages for some problems. I have no knowledge, however, of any performance comparisons in external charts.I'm sorry, but I'm uncertain how to assist further in that regard.
You also ask if there is any way to utilize large pages besides the huge TLBfs library. I'm not familiar with that library, but after some web research, it appears that this library provides a convenientmechanism that does much of the huge page work for you. The use of different sized TLB pages must be controlled by the operating system. The library you describe appears to be a simple method for use on many recent Linux kernels. While you can no doubt create your own mechanisms, I'm not sure that it will be anymore wielding than Huge TLBfs.
I suspect that the optimal page size might be problem size and library call sensitive.
Of course, if you use huge pages with your data and make an MKL call, we'll gain any advantages of having that data in huge pages. However, we doNOT query the system to see if huge pages are available, and use huge pages for any temporary buffers. Nor do we have any utilities thatgive yousuch a broad range of functionalitylike Huge TLBfs.
I realize I haven't answered all of your questions, but I hope something here helps.
Thanks for the reply. For some reason I missed your answer to my Q on this forum.
My inquiry into the large vs regular page performance on the Intel64 platform was motivated by the fact that in other platforms (ie, Power5+/AIX) large pages (namely 64KiB) provide tangible perofmance improvements for various reasons but I couldn't find any discussion about it in the Intel64 architecture.
Given the liimited h/w address translation resources (eg, TLBs) larger pages should benefit Intel64, for instance, making the multi-level data prefetching more efficient, lowering TLB miss penalties, etc. I think that as the number of cores in a ccNUMA architecture icnreases (which is the trend) the TLB miss penalties will become more of an issue. In the high-speed RDMA enabled interconnect domain (IB or iWARP, etc.) large pages are convenient for sharing buffers among device driver and user.
I know that Nehalem supports 2 (or 4) MiB "huge" pages but those are usually unwieldy to manage and may easily become burdensome to users. Moderate size pages as in 64KiB or so are more convenient from kernel and user point-of-view.
Is there any discussion within Intel to make large pages more readily available to users ?