Intel® oneAPI Threading Building Blocks
Ask questions and share information about adding parallelism to your applications when using this threading library.

Working with Memory Mapped Files?

ronag89
Beginner
1,178 Views
I have a function that takes a pointer to some memory that is memory mapped, I want to perform some computation on this memory using a parallel_for loop. The problem is that accessing this pointer can cause blocking if the specific page is not yet transferred into system memory, my question is, how should I handle such a scenario?
Would something like the following work properly?
[cpp]void foo(char* dest, char* source, int size)
{
    assert(size < 128 * 1000);

    // "source" could be a memory mapped file.
	VirtualLock(source, size); // Will this block until the requested data is in memory?
	
	tbb::parallel_for(tbb::blocked_range(0, size), [dest, source](const tbb::blocked_range& r)
	{
		// Read from "source" without blocking the task-scheduler?  ...
	});
	
	VirtualUnlock(ptr, size);
}[/cpp]
0 Kudos
20 Replies
jimdempseyatthecove
Honored Contributor III
1,162 Views
There is no distinction (latency wise) between a memory mapped file and an array allocated but not touched (or an array allocated and paged out). If you have sufficient RAM then consider adding code (immediately following the call to map the file) which walks the file page (4KB/4MB) by page reading a byte/word/dword through the length of the file. This will add the latency to the file open (map) function. You could also open/map the file, spawn a non-TBB thread that walks the file and exits. Your virtual lock then may or may not block depending on which thread touches the VM first (assuming entire app with maped file fits in physical RAM).

Jim Dempsey
0 Kudos
ronag89
Beginner
1,162 Views
So if I understood you correctly something like this?

[cpp]	auto source = MapViewOfFile(file_source, FILE_MAP_READ, high, low, size);

	 // How do I avoid having the compiler optimize away this?
	volatile char dummy;
	for(int n = 0; n < size; n += 4096)    
		dummy = *source;
	
	tbb::parallel_for(tbb::blocked_range(0, size), [dest, source](const tbb::blocked_range& r)
	{
		// Read from "source" without blocking the task-scheduler?  ...
	});[/cpp]

0 Kudos
RafSchietekat
Valued Contributor III
1,162 Views
How about benchmarking a few combinations to find out empirically whether any of that is better than doing nothing special, which is especially difficult to predict without our knowing whether the application has other work while this is going on?

(Added) Also, just one thread to stride across that file may imply a lot of cumulated latency. The VirtualLock may or may not be able to send out multiple read requests in parallel all by itself, assuming that the disk driver can service those quicker than serialised requests (Jim?).
0 Kudos
jimdempseyatthecove
Honored Contributor III
1,162 Views
  1.  // How do I avoid having the compiler optimize away this?   
  2. volatile char dummy;   
  3. for(int n = 0; n < size; n += 4096)  
        dummy = *source;

    You have to out fox the compiler

    Try replacing the dummy, and fetch with _mm_prefetch(&source, _MM_HINT_T0);

    or use dummy as

    volatile char dummy = 0;
    for(int n = 0; n < size; n += 4096)  
        dummy |= *source;
    if(dummy == 0) printf("Not possible");
    // assuming you know the above is not possible

    Jim Dempsey
0 Kudos
SergeyKostrov
Valued Contributor II
1,162 Views
Quoting ronag89
...
          VirtualLock( source, size ); // Will this block until the requested data is in memory?
...


This is what MSDN says:

     The VirtualLock function locks the specified region of the process's virtual address space into physical memory, ensuring that
     subsequent access to the region will not incur a page fault.

0 Kudos
jimdempseyatthecove
Honored Contributor III
1,162 Views
>>VirtualLock

Good point Sergey,

This assumes that the code or other data is not shoved out and/or the size of the VirtualLock(s) do not exceed the process in-memory limit.

Jim Dempsey

0 Kudos
ronag89
Beginner
1,162 Views
The msdn is incorrect according to Raymond (http://blogs.msdn.com/b/oldnewthing/archive/2007/11/06/5924058.aspx), VirtualLock only locks it into the working set. However, I would guess that should be good enough in this case.
I'm doing both VirtualLock and touch each page just to be sure, seems to work well.
0 Kudos
RafSchietekat
Valued Contributor III
1,162 Views
"I'm doing both VirtualLock and touch each page just to be sure, seems to work well."
As I wrote earlier, could you indicate exactly how well compared to no special measures? It is still not clear to me that incurring a big initial delay to avoid any stalling later on is the best approach. If performance is that critical, and the file does not get loaded into memory quickly enough anyway without VirtualLock, I can easily imagine a better approach that detects which parts are currently present and processes them first even as other parts are still being loaded. On Linux that could be done with mmap() and mincore().
0 Kudos
SergeyKostrov
Valued Contributor II
1,162 Views
Hi everybody,

Quoting Raf Schietekat
How about benchmarking a few combinations to find out empirically whether any of that is better than

     [SergeyK] That's a good statement!

doing nothing special...

Also, Ronag89's data set is only 125KB ( 128 * 1000 = 128000 bytes ) and this is ~357x357 matrix of 8-bit elements ( type 'char' ).

Wouldn't be better to use a 'Prefetch' approach? ( See Post #4 by Jim )

Another question is what kind of processing is it going to be done?

Best regards,
Sergey
0 Kudos
RafSchietekat
Valued Contributor III
1,162 Views
Oh yes, it's just a small file... So it both easily depends on what prefetch strategy is implied by the mapping itself (anybody?), and the VirtualLock initial delay should be small enough not to bother with any attempt to start computing while the data is still arriving from disk. Again, hard to say without empirical data and more knowledge about what the program is doing and wants to achieve, and there may not even be enough computational work compared to data loading to even bother with parallelisation here (especially the initial latency for finding the file and start loading it). The prudent (pedantic?) advice would of course be to always first have the data in memory (strictly speaking, the entire mapping and locking should occur outside of TBB, which may require special attention to avoid even any use of TBB now that explicit use of task_scheduler_init is no longer required to associate a thread with the scheduler), but we don't know yet whether it will make an actual difference.

OK, now let's finally have those benchmark numbers, please! :-)
0 Kudos
ronag89
Beginner
1,162 Views
Sorry for late reply.
The entire dataset will be several gigabytes. However, I am sequentially mapping regions (128KB) of the file and then processing the regions in parallel using TBB, tbb::enqueue instead of tbb::parallel_for would probably have been better as an example.
It goes something like this:
for(int bytes = 0; bytes < file_size; 
{
       1. data = map 128 KB;
       2. make sure data access doesn't block
       3. tbb::enqueue(/*...*/);
}
As for how much processing... I am performing huffman decoding of the data.
Will the prefetch approach hint it to load it into L2 cache? Loading into L1 cache would be wasteful since there is no guarantee taht the data will be processed on the same core as where the prefetching is performed. 
Regarding benchmarks, I have had some problem with getting good results since as soon as whatever I'm reading is in the Windows cache, the results are pretty much useless. I will continue to try get some form of benchmark.
0 Kudos
ronag89
Beginner
1,162 Views
It is difficult to get good results from a benchmark, since as soon as the file is in windows cache any results are pretty much useless. I'll continue to try to get emperical data.
The file itself is in the order of several GB, however, I am sequentially mapping regions of the file and then using TBB to process the regions i parallel. tbb::enqueue would have been a better choice for this example than parallel_for. The processing performs huffman decoding on the data.
So basically what I'm doing
for every 128KB in file
    map 128KB region
    ensure region is in memory
    tbb::enqueue region processing
0 Kudos
RafSchietekat
Valued Contributor III
1,162 Views
"It is difficult to get good results from a benchmark, since as soon as the file is in windows cache any results are pretty much useless. I'll continue to try to get emperical data."
Sorry, wrong choice of words. Of course you should only scale down if you know that the measurement will remain representative, or perhaps be related in a predictable way.

"The file itself is in the order of several GB, however, I am sequentially mapping regions of the file and then using TBB to process the regions i parallel. tbb::enqueue would have been a better choice for this example than parallel_for. The processing performs huffman decoding on the data.
So basically what I'm doing
for every 128KB in file
    map 128KB region
    ensure region is in memory
    tbb::enqueue region processing"
Now that's the context we needed.

I wouldn't bother with locking anything in memory, but you should have a separate non-TBB scouting thread that keeps track of where processing is currently occurring and that strides ahead over the mapped file to touch every memory page up to a distance that you may have to tune, perhaps based on the underlying storage (disk or solid state), because you want to have that latency absorbed somewhere else than inside the main processing loop. Perhaps that strategy could include locking, but I don't think it would bring anything, as long as you don't let the scout get too far ahead.

(Added) Hmm, why is this formatted differently...
0 Kudos
SergeyKostrov
Valued Contributor II
1,162 Views
Hi everybody,

Quoting ronag89
It is difficult to get good results from a benchmark, since as soon as the file is in windows cache any results are pretty much useless... 

Raf is asking you to get some real performance numbers and I also would be glad to see results. I think you can do a very simple,
just one line modifications, to get these tests done:

>> Test #1 <<
     ...
     VirtualLock( ... )                           // With VirtualLock
     ...

>> Test #2 <<
     ...
//   VirtualLock( ... )                           // Without VirtualLock
     ...

>> Test #3 <<
     ...
     Prefetch( ... )                               // With Prefetch as Jim suggested
     ...

This is it.

Best regards,
Sergey
0 Kudos
RafSchietekat
Valued Contributor III
1,162 Views
"This is it."
The solution may be a combination of ideas. If we may assume that processing is quick enough compared with I/O, the original idea may still suffice (lock, process, unlock, repeat). Otherwise, Jim first suggested to have another thread absorb the latency. After additional information was provided (about the significant length of the total file) I added the idea to keep the scout thread reined in to a maximum distance ahead of where processing is occurring. Maybe it may still help to have multiple scout threads, or perhaps one that instead lays out up to 3 simultaneous VirtualLock()'ed regions (one where processing is currently occurring, one acquired and ready, one being acquired) whose size then merits tuning. Only by trying will you discover which is best. I will now try to rein myself in... :-)
0 Kudos
SergeyKostrov
Valued Contributor II
1,162 Views
"This is it."
The solution may be a combination of ideas.
If we may assume that processing is quick enough compared with I/O... 

'Ronag89' already informed that application does Huffman Decoding on a file of several GBs in size. So, in that case a complete
scan of the data set has to be done in order to build a binary tree of nodes with a character, its frequency and a link to a parent node
if needed.

PS: Not related to a primary subject of the thread: A very simple test could be done:

- Find a folder on your system that has a couple of GBs of different files
- Compress the folder with Winzip or Winrar
- How long did it take?

On my system it took a couple of minutes ( CPU was busy for 100% ). As far as I know Winzip doesn't use pure Huffman Decoding algorithm and
it uses a hybrid algorithm based on LZ77 and Huffman Decoding. Anyway, it gives you a general idea.

It gets very interesting and I'll try to do tests with VirtualLock, without VirtualLock and with Prefetch on a different aldorithm
that uses lots of memory: Strassen's Heap Based algorithm for matrix multiplication. Currently it uses the Prefetch approach and
it will be very easy to test with VirtualLock and without VirtualLock Win32 API functions.

Best regards,
Sergey
0 Kudos
RafSchietekat
Valued Contributor III
1,162 Views
A useful piece of information would be to time the fraction of the time spent processing in the initial implementation idea (lock, compute, unlock, repeat), because this indicates a limit to any optimisation (it should be neither too low nor too high).
0 Kudos
SergeyKostrov
Valued Contributor II
1,162 Views
...I'll try to do tests with VirtualLock, without VirtualLock and with Prefetch on a different aldorithm
that uses lots of memory: Strassen's Heap Based algorithm for matrix multiplication...
 

After some number of tests my conclusion is it depends on an algorithm (!):

   Strassen HBI with Prefetch is ~0.5% slower then with VirtualLock
          ( algorithm doesn't use too much memory )

   Strassen HBC with Prefetch is ~5.9% faster then with VirtualLock
          ( algorithm uses significant amount of memory )

Here are tests results:

>> Prefetch used <<

   Strassen HBI
   Matrix Size          : 1024 x 1024
   Matrix Size Threshold: 512 x 512
   Matrix Partitions    : 1
   ResultSets Reflection: N/A
   Calculating...
   Strassen HBI - Pass  1 - Completed: 15.43800 secs
   Strassen HBI - Pass  2 - Completed: 15.35900 secs
   Strassen HBI - Pass  3 - Completed: 15.35900 secs
   Strassen HBI - Pass  4 - Completed: 15.34400 secs
   Strassen HBI - Pass  5 - Completed: 15.36000 secs
   ALGORITHM_STRASSEN_HBI - Passed

   Strassen HBC
   Matrix Size          : 1024 x 1024
   Matrix Size Threshold: 64 x 64
   Matrix Partitions    : 2801
   ResultSets Reflection: Enabled
   Calculating...
   Strassen HBC - Pass  1 - Completed: 5.67200 secs
   Strassen HBC - Pass  2 - Completed: 4.42200 secs
   Strassen HBC - Pass  3 - Completed: 4.42200 secs
   Strassen HBC - Pass  4 - Completed: 4.42200 secs
   Strassen HBC - Pass  5 - Completed: 4.42200 secs
   ALGORITHM_STRASSEN_HBC - Passed

>> VirtualLock used <<

   Strassen HBI
   Matrix Size          : 1024 x 1024
   Matrix Size Threshold: 512 x 512
   Matrix Partitions    : 1
   ResultSets Reflection: N/A
   Calculating...
   Strassen HBI - Pass  1 - Completed: 15.37500 secs
   Strassen HBI - Pass  2 - Completed: 15.28100 secs
   Strassen HBI - Pass  3 - Completed: 15.28200 secs
   Strassen HBI - Pass  4 - Completed: 15.29700 secs
   Strassen HBI - Pass  5 - Completed: 15.28100 secs
   ALGORITHM_STRASSEN_HBI - Passed

   Strassen HBC
   Matrix Size          : 1024 x 1024
   Matrix Size Threshold: 64 x 64
   Matrix Partitions    : 2801
   ResultSets Reflection: Enabled
   Calculating...
   Strassen HBC - Pass  1 - Completed: 5.93700 secs
   Strassen HBC - Pass  2 - Completed: 4.68800 secs
   Strassen HBC - Pass  3 - Completed: 4.68700 secs
   Strassen HBC - Pass  4 - Completed: 4.70300 secs
   Strassen HBC - Pass  5 - Completed: 4.68800 secs
   ALGORITHM_STRASSEN_HBC - Passed

Notes:

   Strassen HBI - Strassen Heap-Based Incomplete algorithm for matrix multiplication
   Strassen HBC - Strassen Heap-Based Complete algorithm for matrix multiplication

0 Kudos
SergeyKostrov
Valued Contributor II
1,162 Views
This is a follow up... Hi everybody, I'd like to stress that Microsoft's Virtual Memory Manager ( VMM ) will take care of all memory related issues. It's a big question if you need to do anything else. Take a look at results of a matrix multiplication test ( 2048x2048 / single-precision type ) on a system with constrained memory resources: ... Strassen HBC Matrix Size : 2048 x 2048 Matrix Size Threshold: 128 x 128 Matrix Partitions : 2801 ResultSets Reflection: Enabled Calculating... Strassen HBC - Pass 1 - Completed: 343.42200 secs Strassen HBC - Pass 2 - Completed: 82.95300 secs Strassen HBC - Pass 3 - Completed: 76.42200 secs Strassen HBC - Pass 4 - Completed: 76.39000 secs Strassen HBC - Pass 5 - Completed: 76.14100 secs ALGORITHM_STRASSEN_HBC - Passed ... In that test Strassen HBC algorithm uses ~1.9GB of memory on a 32-bit Windows platform. During a 'Pass 1' VMM pre-created all pages and it took ~263 secs in total ( ~343 - ~80 = ~263 ). For the following 'Passes 2, 3, 4, 5' there is no need to do it and VMM loads pages into memory when they are needed. Also, a significant amount of pages could be already cached. In overall, it increases performance of calculations in ~4.5 times for 'Passes 2, 3, 4, 5'. My point of view is as follows: - Test your algorithm as better as possible in order to see if a page locking technique with VirtualLock Win32 API function works and increases performance of computations - A call overhead of Intel CPU prefetchtx instructions is significantly smaller when compared to VirtualLock Win32 API function - Consider a complete pre-scan of your Memory Mapped File in order to 'warm-up' the data: ... WarmUpData( hMemMappedFile ); ... // Do Parallel Computations ( without a call to VirtualLock but with a call to Prefetch ) parallel_for( ... ) { ... } ... and verify that it increases performance of computations. Best regards, Sergey
0 Kudos
jimdempseyatthecove
Honored Contributor III
989 Views
I suggest that you do not use parallel_for for the outer level of your processing (gulp by gulp of memory mapped file). Instead, consider using parallel_pipeline *** together *** with a prefetcher thread. Reason: Use of parallel_for will partition iteration space and therefore introduce additional seek latencies. parallel_pipeline would result in each thread processing (gulps) sequentially, thus reducing seek latencies. The pre-fetch thread would hopefully assure data is present in RAM prior to TBB task taking next node for processing. If you do not want to expend the effort in writing a pre-fetch thread, then use the parallel_pipeline, but then also oversubscribe the number of threads e.g. number of HW threads + 1 or 2. Don't oversubscribe too far. When a stalls for read a full complement of threads can continue processing. Much will depend on the ratio of computational latency verses I/O latency. Jim Dempsey
0 Kudos
Reply