- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Roman,
In looking at your chart, the 1.02 line, if you had peaked the memory band width you would have expected the line from 3-4 cores to have been flat (near flat). i.e. closer to the 1.17 line. The precipitous drop indicate cache evictions are taking place. As to the cause, you may have to spend some time using a profiler. However, in doing so you will likely alter the results.
You haven't said much about the application. From looking at the chart, one potential cause is your threading model.
Would you by chance be compressing the datausing your scaling 1, 2, 3, 4 threads and using an additional thread in each configuration to write the results? If so, you would expect to see the lines as you have produced. As you increase the compression ratio, the work of the extra thread reduces. Just a wild guess on my part.
Jim Dempsey
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Roman,
In looking at your chart, the 1.02 line, if you had peaked the memory band width you would have expected the line from 3-4 cores to have been flat (near flat). i.e. closer to the 1.17 line. The precipitous drop indicate cache evictions are taking place. As to the cause, you may have to spend some time using a profiler. However, in doing so you will likely alter the results.
You haven't said much about the application. From looking at the chart, one potential cause is your threading model.
Would you by chance be compressing the datausing your scaling 1, 2, 3, 4 threads and using an additional thread in each configuration to write the results? If so, you would expect to see the lines as you have produced. As you increase the compression ratio, the work of the extra thread reduces. Just a wild guess on my part.
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Roman,
In looking at your chart, the 1.02 line, if you had peaked the memory band width you would have expected the line from 3-4 cores to have been flat (near flat). i.e. closer to the 1.17 line. The precipitous drop indicate cache evictions are taking place. As to the cause, you may have to spend some time using a profiler. However, in doing so you will likely alter the results.
You haven't said much about the application. From looking at the chart, one potential cause is your threading model.
Would you by chance be compressing the datausing your scaling 1, 2, 3, 4 threads and using an additional thread in each configuration to write the results? If so, you would expect to see the lines as you have produced. As you increase the compression ratio, the work of the extra thread reduces. Just a wild guess on my part.
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Lets talk about the 4 thread scenario
Each of the 4 threads has a circular buffer, right?
What fills each buffer? - what empties each buffer?
Is the filling buffer thread an auxilliary thread supplying data? (or volunteer from one or all of the 4 worker threads?)
Is the emptying buffer threadan auxilliary thread (same or other thread as thread filling data)? (or volunteer from one or all of the 4 worker threads?)
If another thread or threads are involved in filling and/or emptying these buffers, then this is the cause for the drop in your curve (this is what I tried to explain in prior post).
These threads (if present), or volunteer from one or all of the 4 worker threads, is consuming time (more importantly processor cycles) that are not factored _out_ of your scale-ability curve. The less "compression" the more data must be exported from the circular buffers. The processor "bandwidth" required to export this data is inversely proportional to the compression ratio. And this processor "bandwidth" is lost from the available bandwidt to perform the compression. When using extra threads to export data, these threads work more (consume available processor bandwidth) for lesser compressed data than for more compressed data.
As a quick test, null out the routine that exports the data from your circular buffer. See what your curves look like.
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Each of the 4 threads has a circular buffer, right?
What fills each buffer? - what empties each buffer?
These threads (if present), or volunteer from one or all of the 4 worker threads, is consuming time (more importantly processor cycles) that are not factored _out_ of your scale-ability curve. The less "compression" the more data must be exported from the circular buffers. The processor "bandwidth" required to export this data is inversely proportional to the compression ratio. And this processor "bandwidth" is lost from the available bandwidt to perform the compression. When using extra threads to export data, these threads work more (consume available processor bandwidth) for lesser compressed data than for more compressed data.
As a quick test, null out the routine that exports the data from your circular buffer. See what your curves look like.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Roman,
>>If the OS needs to do something and it interrupts a thread for a millisecond then that is enough time to fill all the buffers and potentially block the computation.
After reading your paper and thinking about it over a cup of coffee I have come up with a recommendation. Perhaps you have already tried this technique. If not, and it provides the solution, I would appreciate a mention of my contribution in your publications.
You've (we've)postulated the drop in the last thread scaling factor is due to a non-your-application process (interloper thread)being run by the O/S that delays one of your threads from completing a task within a time window.
The suggested work around won't be free of overhead but the overhead expense can be tuned and overhead traded off against worst case (or average case)latency.
Since inserting Locks is an expensive operation we will construct the work around using "lazy writes" (non-LOCK prefixed write instructions and/or memory fenced write instructions).
Observation 1) "lazy writes" by one thread will eventually be seen by other threads.
Observation 2) "lazy writes" may be seen by thread(s) sharing L2 sooner than by threads not sharing L2
Observation 3) Data in L2 of one thread is visible to thread(s) sharing L2.
Observation 4) The interloper thread may intervene your"computation" phase, your local heap merge phase, or global heap merge phase.
Observation 5) A merge operation has a finite number of inputs and outputs and forgiven sets of input data will produce the same set of output data.
Observation 6) For any given merge, either thread sharing the same L2 should have near-same performance (small issue with L1)
Observation 7) For any given merge, and given a reconciled starting point, multiple threads can perform part of the merge over the same section of input/output data without disturbing the results (but at an interaction expense)
Observation 8) For any merge, a merge may be suspended (by interloper thread)
Observation 9) Using using observation 1) any thread A sharing L2 with thread B (A' sharing other L2 with B', etc...) can monitor the other L2 sharing thread's progress: not exactly, due to lazy writes, but to the extent that it can observe the other thread has a) completed or b) is blocked (by interloper thread)
Now, to work the observations to your advantage:
For the local heap merge:
Use indexes as opposed to pointers. Place your two input indexes into a word aligned structure. On each iteration, or as part of tweak, on each n iterations, perform lazy write of the two input indexes as word to memory (non-fencing, non-atomic). The purpose of this write is to permit the other L2 sharing thread, should it be interested, in monitoring your merge progress.
Also, as part of your merge, on each iteration, or as part of tweak. on each n iterations, read from RAM the paired indexes you wrote on prior iteratonand compare with what is expected to have been in the indexes.
For local heap merge all four threads begin their merge, on each iteration or as part of tweak, on each n iterations, the indices of the prior iteration or nth prior iteration is read from RAM and compared against the expected indices. When no interloper thread ran in your space, your read indices will == expected indices and in which case you continue with the merge.
In your merge, after selecting which of the two input candidates and each iteration or n'th iteration you write your selected candidate indices out to RAM (using lazy write). When after you perform the read of the stored indices but observe != to expected indices then assume your thread merge operation was interrupted by the interloper thread. In this case you enter observer mode. In observer mode you monitor the RAM saved indices state to check for indices indicating completion or indices not advancing. The observation loop will have to determine if it is beneficial to use:just burn CPU, _mm_pause, (fence), SwitchToThread, etc...). Your merge phase, once in observation mode, stays in observation mode until: termination detected, or until non-advancing condition detected. When non-advancing situation is detected, flip back to merge phase.
Any local heap merge thread is interruptible by interloper thread, any interrupted thread merge operation is capable of being resumed by other thread(s) sharing L2.
For Global heap merge use similar technique, except that the pool of threads permitted to resume an interrupted merge can be any thread sharing the L3. (optimization for multi-socket-ed systems).
Sorry for not writing s pseudo code example (not on your payroll).
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Think of the method I suggested as a "Wait-Free" technique whereby when one thread stalls due to O/S intervention another thread can resume the unfinished task.
Although sleep may work some of the times, it will not work all (or near all) of the times.
Consider using SwitchToThread() over sleep. Most systems provide Sleep(miliseconds) not microseconds.
If the average throughput is what you are interested in then the Sleep or SwitchToThread may be sufficient.
If minimum worst case timing is important then consider a "wait-free" technique such as I proposed.
The wait-free code would be relatively simple to implement. Maybe tomorrow I will write some pseudo code.
Your chart looks very good now by the way.
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Roman,
Untested, un-optimized code sampel follows.
Jim Dempsey
[cpp]// buffer.cpp : Defines the entry point for the console application. // #include "stdafx.h" int _tmain(int argc, _TCHAR* argv[]) { return 0; } const int sizeData = 123; struct Item { int key; float data[sizeData]; }; const int BufferMax = 456; struct Buffer { int count; Item items[BufferMax]; }; struct CopyControl { unsigned __int32 indexPair; Buffer outC; Buffer inA; Buffer inB; }; void MergeBuffer( CopyControl& aCopyControl) { // may need to expand to __int64 unsigned __int32 indexPairCopy1 = 0; unsigned __int32 indexPairCopy2 = 0; unsigned __int32 MergeDone = (aCopyControl.inA.count << 16) + aCopyControl.inB.count; for(;;) { indexPairCopy1 = aCopyControl.indexPair; if(indexPairCopy1 == MergeDone) return; // see if expected value while(indexPairCopy1 != indexPairCopy2) { // other thread took over control (we stalled) indexPairCopy2 = indexPairCopy1; SwitchToThread(); // or _mm_pause(); indexPairCopy1 = aCopyControl.indexPair; if(indexPairCopy1 == MergeDone) return; } // here when we are in copy mode unsigned int indexA = indexPairCopy1 >> 16; // union would be faster/better unsigned int indexB = indexPairCopy1 & 0xFFFF; unsigned int indexC = indexA + indexB; if(aCopyControl.inA.count < indexA) { if( (aCopyControl.inB.count == indexB) || (aCopyControl.inA.items[indexA].key < aCopyControl.inB.items[indexB].key) ) { // pick A aCopyControl.outC.items[indexC] = aCopyControl.inA.items[indexB]; indexPairCopy2 += 0x00010000; // union would be faster/better continue; } } // pick B aCopyControl.outC.items[indexC] = aCopyControl.inB.items[indexB]; indexPairCopy2 += 0x00000001; // union would be faster/better } } // prior to call by either thread // myThread.indexPair = 0; // otherThread.indexPair = 0; void MergeBuffers( CopyControl& myThread, CopyControl& otherThread) { MergeBuffer(myThread); // my primary concern MergeBuffer(otherThread); // my secondary concern myThread.indexPair = 0; // reset only myThread.indexPair for next time } [/cpp]
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page