- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
[cpp]void foo(char* dest, char* source, int size) { assert(size < 128 * 1000); // "source" could be a memory mapped file. VirtualLock(source, size); // Will this block until the requested data is in memory? tbb::parallel_for(tbb::blocked_range(0, size), [dest, source](const tbb::blocked_range& r) { // Read from "source" without blocking the task-scheduler? ... }); VirtualUnlock(ptr, size); }[/cpp]
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
[cpp] auto source = MapViewOfFile(file_source, FILE_MAP_READ, high, low, size); // How do I avoid having the compiler optimize away this? volatile char dummy; for(int n = 0; n < size; n += 4096) dummy = *source; tbb::parallel_for(tbb::blocked_range(0, size), [dest, source](const tbb::blocked_range & r) { // Read from "source" without blocking the task-scheduler? ... });[/cpp]
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
(Added) Also, just one thread to stride across that file may imply a lot of cumulated latency. The VirtualLock may or may not be able to send out multiple read requests in parallel all by itself, assuming that the disk driver can service those quicker than serialised requests (Jim?).
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- // How do I avoid having the compiler optimize away this?
- volatile char dummy;
- for(int n = 0; n < size; n += 4096)
dummy = *source;
You have to out fox the compiler
Try replacing the dummy, and fetch with _mm_prefetch(&source, _MM_HINT_T0);
or use dummy as
volatile char dummy = 0;
for(int n = 0; n < size; n += 4096)
dummy |= *source;
if(dummy == 0) printf("Not possible");
// assuming you know the above is not possible
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
VirtualLock( source, size ); // Will this block until the requested data is in memory?
...
This is what MSDN says:
The VirtualLock function locks the specified region of the process's virtual address space into physical memory, ensuring that
subsequent access to the region will not incur a page fault.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Good point Sergey,
This assumes that the code or other data is not shoved out and/or the size of the VirtualLock(s) do not exceed the process in-memory limit.
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
As I wrote earlier, could you indicate exactly how well compared to no special measures? It is still not clear to me that incurring a big initial delay to avoid any stalling later on is the best approach. If performance is that critical, and the file does not get loaded into memory quickly enough anyway without VirtualLock, I can easily imagine a better approach that detects which parts are currently present and processes them first even as other parts are still being loaded. On Linux that could be done with mmap() and mincore().
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
[SergeyK] That's a good statement!
doing nothing special...
Also, Ronag89's data set is only 125KB ( 128 * 1000 = 128000 bytes ) and this is ~357x357 matrix of 8-bit elements ( type 'char' ).
Wouldn't be better to use a 'Prefetch' approach? ( See Post #4 by Jim )
Another question is what kind of processing is it going to be done?
Best regards,
Sergey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
OK, now let's finally have those benchmark numbers, please! :-)
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Sorry, wrong choice of words. Of course you should only scale down if you know that the measurement will remain representative, or perhaps be related in a predictable way.
Now that's the context we needed.
I wouldn't bother with locking anything in memory, but you should have a separate non-TBB scouting thread that keeps track of where processing is currently occurring and that strides ahead over the mapped file to touch every memory page up to a distance that you may have to tune, perhaps based on the underlying storage (disk or solid state), because you want to have that latency absorbed somewhere else than inside the main processing loop. Perhaps that strategy could include locking, but I don't think it would bring anything, as long as you don't let the scout get too far ahead.
(Added) Hmm, why is this formatted differently...
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Raf is asking you to get some real performance numbers and I also would be glad to see results. I think you can do a very simple,
just one line modifications, to get these tests done:
>> Test #1 <<
...
VirtualLock( ... ) // With VirtualLock
...
>> Test #2 <<
...
// VirtualLock( ... ) // Without VirtualLock
...
>> Test #3 <<
...
Prefetch( ... ) // With Prefetch as Jim suggested
...
This is it.
Best regards,
Sergey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
The solution may be a combination of ideas. If we may assume that processing is quick enough compared with I/O, the original idea may still suffice (lock, process, unlock, repeat). Otherwise, Jim first suggested to have another thread absorb the latency. After additional information was provided (about the significant length of the total file) I added the idea to keep the scout thread reined in to a maximum distance ahead of where processing is occurring. Maybe it may still help to have multiple scout threads, or perhaps one that instead lays out up to 3 simultaneous VirtualLock()'ed regions (one where processing is currently occurring, one acquired and ready, one being acquired) whose size then merits tuning. Only by trying will you discover which is best. I will now try to rein myself in... :-)
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
The solution may be a combination of ideas. If we may assume that processing is quick enough compared with I/O...
'Ronag89' already informed that application does Huffman Decoding on a file of several GBs in size. So, in that case a complete
scan of the data set has to be done in order to build a binary tree of nodes with a character, its frequency and a link to a parent node
if needed.
PS: Not related to a primary subject of the thread: A very simple test could be done:
- Find a folder on your system that has a couple of GBs of different files
- Compress the folder with Winzip or Winrar
- How long did it take?
On my system it took a couple of minutes ( CPU was busy for 100% ). As far as I know Winzip doesn't use pure Huffman Decoding algorithm and
it uses a hybrid algorithm based on LZ77 and Huffman Decoding. Anyway, it gives you a general idea.
It gets very interesting and I'll try to do tests with VirtualLock, without VirtualLock and with Prefetch on a different aldorithm
that uses lots of memory: Strassen's Heap Based algorithm for matrix multiplication. Currently it uses the Prefetch approach and
it will be very easy to test with VirtualLock and without VirtualLock Win32 API functions.
Best regards,
Sergey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
that uses lots of memory: Strassen's Heap Based algorithm for matrix multiplication...
After some number of tests my conclusion is it depends on an algorithm (!):
Strassen HBI with Prefetch is ~0.5% slower then with VirtualLock
( algorithm doesn't use too much memory )
Strassen HBC with Prefetch is ~5.9% faster then with VirtualLock
( algorithm uses significant amount of memory )
Here are tests results:
>> Prefetch used <<
Strassen HBI
Matrix Size : 1024 x 1024
Matrix Size Threshold: 512 x 512
Matrix Partitions : 1
ResultSets Reflection: N/A
Calculating...
Strassen HBI - Pass 1 - Completed: 15.43800 secs
Strassen HBI - Pass 2 - Completed: 15.35900 secs
Strassen HBI - Pass 3 - Completed: 15.35900 secs
Strassen HBI - Pass 4 - Completed: 15.34400 secs
Strassen HBI - Pass 5 - Completed: 15.36000 secs
ALGORITHM_STRASSEN_HBI - Passed
Strassen HBC
Matrix Size : 1024 x 1024
Matrix Size Threshold: 64 x 64
Matrix Partitions : 2801
ResultSets Reflection: Enabled
Calculating...
Strassen HBC - Pass 1 - Completed: 5.67200 secs
Strassen HBC - Pass 2 - Completed: 4.42200 secs
Strassen HBC - Pass 3 - Completed: 4.42200 secs
Strassen HBC - Pass 4 - Completed: 4.42200 secs
Strassen HBC - Pass 5 - Completed: 4.42200 secs
ALGORITHM_STRASSEN_HBC - Passed
>> VirtualLock used <<
Strassen HBI
Matrix Size : 1024 x 1024
Matrix Size Threshold: 512 x 512
Matrix Partitions : 1
ResultSets Reflection: N/A
Calculating...
Strassen HBI - Pass 1 - Completed: 15.37500 secs
Strassen HBI - Pass 2 - Completed: 15.28100 secs
Strassen HBI - Pass 3 - Completed: 15.28200 secs
Strassen HBI - Pass 4 - Completed: 15.29700 secs
Strassen HBI - Pass 5 - Completed: 15.28100 secs
ALGORITHM_STRASSEN_HBI - Passed
Strassen HBC
Matrix Size : 1024 x 1024
Matrix Size Threshold: 64 x 64
Matrix Partitions : 2801
ResultSets Reflection: Enabled
Calculating...
Strassen HBC - Pass 1 - Completed: 5.93700 secs
Strassen HBC - Pass 2 - Completed: 4.68800 secs
Strassen HBC - Pass 3 - Completed: 4.68700 secs
Strassen HBC - Pass 4 - Completed: 4.70300 secs
Strassen HBC - Pass 5 - Completed: 4.68800 secs
ALGORITHM_STRASSEN_HBC - Passed
Notes:
Strassen HBI - Strassen Heap-Based Incomplete algorithm for matrix multiplication
Strassen HBC - Strassen Heap-Based Complete algorithm for matrix multiplication
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page