Solved: Core i7 memory bandwidth

roman_pearce · ‎02-13-2009

I'm writing a parallel program that is designed for shared-cache multicore processors. This problem is memory-bound, but I have a high performance program that runs in the cache, etc, etc. Generally speaking, performance on the Core i7 is fantastic, however I encountered some strange results. This graph

http://www.cecm.sfu.ca/~rpearcea/corei7_speedup.pdf

shows parallel speedup on the core i7 as a function of the amount of data produced. The program creates 1 GB of data, and W(f,g) is the factor by which this data is reduced by processing (in the cache) before being written to memory. For example, when W(f,g)=1 the result is 1 GB, and when W(f,g)=1024 the result is 1MB. The program is designed to achieve linear speedup, and it often gets superlinear speedup because of the dedicated caches on the Core i7.

The graph is strange because when lots of data is produced the speedup on 4 cores goes way down but with 3 cores it is ok. This is a Core i7 920 2.66GHz with 6GB DDR3 with speedstep and hyperthreading disabled. Now this is obviously using a lot of memory bandwidth, but the bandwidth seems to be nowhere near the published limitations of the Core i7 or DDR3. It seems I am unable to break 330 MB/sec. I tried prefetching but it didn't help.

Does anyone have any suggestions ?

By the way, the Core i7 is a great chip for parallel programs. Thanks!

jimdempseyatthecove · ‎02-13-2009

Roman,

In looking at your chart, the 1.02 line, if you had peaked the memory band width you would have expected the line from 3-4 cores to have been flat (near flat). i.e. closer to the 1.17 line. The precipitous drop indicate cache evictions are taking place. As to the cause, you may have to spend some time using a profiler. However, in doing so you will likely alter the results.

You haven't said much about the application. From looking at the chart, one potential cause is your threading model.

Would you by chance be compressing the datausing your scaling 1, 2, 3, 4 threads and using an additional thread in each configuration to write the results? If so, you would expect to see the lines as you have produced. As you increase the compression ratio, the work of the extra thread reduces. Just a wild guess on my part.

Jim Dempsey

View solution in original post

jimdempseyatthecove · ‎02-13-2009

Roman,

In looking at your chart, the 1.02 line, if you had peaked the memory band width you would have expected the line from 3-4 cores to have been flat (near flat). i.e. closer to the 1.17 line. The precipitous drop indicate cache evictions are taking place. As to the cause, you may have to spend some time using a profiler. However, in doing so you will likely alter the results.

You haven't said much about the application. From looking at the chart, one potential cause is your threading model.

Would you by chance be compressing the datausing your scaling 1, 2, 3, 4 threads and using an additional thread in each configuration to write the results? If so, you would expect to see the lines as you have produced. As you increase the compression ratio, the work of the extra thread reduces. Just a wild guess on my part.

Jim Dempsey

roman_pearce · ‎02-13-2009

Quoting - jimdempseyatthecove

Roman,

In looking at your chart, the 1.02 line, if you had peaked the memory band width you would have expected the line from 3-4 cores to have been flat (near flat). i.e. closer to the 1.17 line. The precipitous drop indicate cache evictions are taking place. As to the cause, you may have to spend some time using a profiler. However, in doing so you will likely alter the results.

You haven't said much about the application. From looking at the chart, one potential cause is your threading model.

Would you by chance be compressing the datausing your scaling 1, 2, 3, 4 threads and using an additional thread in each configuration to write the results? If so, you would expect to see the lines as you have produced. As you increase the compression ratio, the work of the extra thread reduces. Just a wild guess on my part.

Jim Dempsey

Thank you for your comment, especially about memory bandwidth. I may have to do some profiling. The application multiplies sparse polynomials, and the data is reduced by adding up like terms. I am using finite circular buffers for each thread, and so far it looks like the buffers are filling up or running out whenever four threads are sending a lot of data. For 2-3 threads the size of the buffers directly affects the amount of spinning (typically none) but with four threads it's bad no matter what.

So I need to fix my program. Thanks for your suggestions!

jimdempseyatthecove · ‎02-14-2009

Lets talk about the 4 thread scenario

Each of the 4 threads has a circular buffer, right?

What fills each buffer? - what empties each buffer?

Is the filling buffer thread an auxilliary thread supplying data? (or volunteer from one or all of the 4 worker threads?)

Is the emptying buffer threadan auxilliary thread (same or other thread as thread filling data)? (or volunteer from one or all of the 4 worker threads?)

If another thread or threads are involved in filling and/or emptying these buffers, then this is the cause for the drop in your curve (this is what I tried to explain in prior post).

These threads (if present), or volunteer from one or all of the 4 worker threads, is consuming time (more importantly processor cycles) that are not factored _out_ of your scale-ability curve. The less "compression" the more data must be exported from the circular buffers. The processor "bandwidth" required to export this data is inversely proportional to the compression ratio. And this processor "bandwidth" is lost from the available bandwidt to perform the compression. When using extra threads to export data, these threads work more (consume available processor bandwidth) for lesser compressed data than for more compressed data.

As a quick test, null out the routine that exports the data from your circular buffer. See what your curves look like.

Jim Dempsey

roman_pearce · ‎02-15-2009

Quoting - jimdempseyatthecove

Lets talk about the 4 thread scenario
Each of the 4 threads has a circular buffer, right?
What fills each buffer? - what empties each buffer?

These threads (if present), or volunteer from one or all of the 4 worker threads, is consuming time (more importantly processor cycles) that are not factored _out_ of your scale-ability curve. The less "compression" the more data must be exported from the circular buffers. The processor "bandwidth" required to export this data is inversely proportional to the compression ratio. And this processor "bandwidth" is lost from the available bandwidt to perform the compression. When using extra threads to export data, these threads work more (consume available processor bandwidth) for lesser compressed data than for more compressed data.

As a quick test, null out the routine that exports the data from your circular buffer. See what your curves look like.

The four threads each fill their own buffer and take turns merging the contents of all the buffers (essentially using trylock). I understand that transmitting data from one core to another is consuming processor bandwidth and adding overhead, that's ok. If I turn off buffer consumption and the threads run independently then the algorithm has super linear speedup. I think some problems are caused by scheduling. The buffers have to fit in L3, which limits their size and leaves very little margin for error. If the OS needs to do something and it interrupts a thread for a millisecond then that is enough time to fill all the buffers and potentially block the computation. When the data is reduced 100:1 then this is not a problem. I need to address this problem in a more general way (i.e. get good performance when some cpus are in use), so hopefully I can take care of both problems at once.

If you're interested, the algorithm is described here:

http://www.cecm.sfu.ca/~rpearcea/sdmp/sdmp_pmul.pdf

Thanks again for your help.

jimdempseyatthecove · ‎02-16-2009

Roman,

>>If the OS needs to do something and it interrupts a thread for a millisecond then that is enough time to fill all the buffers and potentially block the computation.

After reading your paper and thinking about it over a cup of coffee I have come up with a recommendation. Perhaps you have already tried this technique. If not, and it provides the solution, I would appreciate a mention of my contribution in your publications.

You've (we've)postulated the drop in the last thread scaling factor is due to a non-your-application process (interloper thread)being run by the O/S that delays one of your threads from completing a task within a time window.

The suggested work around won't be free of overhead but the overhead expense can be tuned and overhead traded off against worst case (or average case)latency.

Since inserting Locks is an expensive operation we will construct the work around using "lazy writes" (non-LOCK prefixed write instructions and/or memory fenced write instructions).

Observation 1) "lazy writes" by one thread will eventually be seen by other threads.
Observation 2) "lazy writes" may be seen by thread(s) sharing L2 sooner than by threads not sharing L2
Observation 3) Data in L2 of one thread is visible to thread(s) sharing L2.
Observation 4) The interloper thread may intervene your"computation" phase, your local heap merge phase, or global heap merge phase.
Observation 5) A merge operation has a finite number of inputs and outputs and forgiven sets of input data will produce the same set of output data.
Observation 6) For any given merge, either thread sharing the same L2 should have near-same performance (small issue with L1)
Observation 7) For any given merge, and given a reconciled starting point, multiple threads can perform part of the merge over the same section of input/output data without disturbing the results (but at an interaction expense)
Observation 8) For any merge, a merge may be suspended (by interloper thread)
Observation 9) Using using observation 1) any thread A sharing L2 with thread B (A' sharing other L2 with B', etc...) can monitor the other L2 sharing thread's progress: not exactly, due to lazy writes, but to the extent that it can observe the other thread has a) completed or b) is blocked (by interloper thread)

Now, to work the observations to your advantage:

For the local heap merge:

Use indexes as opposed to pointers. Place your two input indexes into a word aligned structure. On each iteration, or as part of tweak, on each n iterations, perform lazy write of the two input indexes as word to memory (non-fencing, non-atomic). The purpose of this write is to permit the other L2 sharing thread, should it be interested, in monitoring your merge progress.

Also, as part of your merge, on each iteration, or as part of tweak. on each n iterations, read from RAM the paired indexes you wrote on prior iteratonand compare with what is expected to have been in the indexes.

For local heap merge all four threads begin their merge, on each iteration or as part of tweak, on each n iterations, the indices of the prior iteration or nth prior iteration is read from RAM and compared against the expected indices. When no interloper thread ran in your space, your read indices will == expected indices and in which case you continue with the merge.

In your merge, after selecting which of the two input candidates and each iteration or n'th iteration you write your selected candidate indices out to RAM (using lazy write). When after you perform the read of the stored indices but observe != to expected indices then assume your thread merge operation was interrupted by the interloper thread. In this case you enter observer mode. In observer mode you monitor the RAM saved indices state to check for indices indicating completion or indices not advancing. The observation loop will have to determine if it is beneficial to use:just burn CPU, _mm_pause, (fence), SwitchToThread, etc...). Your merge phase, once in observation mode, stays in observation mode until: termination detected, or until non-advancing condition detected. When non-advancing situation is detected, flip back to merge phase.

Any local heap merge thread is interruptible by interloper thread, any interrupted thread merge operation is capable of being resumed by other thread(s) sharing L2.

For Global heap merge use similar technique, except that the pool of threads permitted to resume an interrupted merge can be any thread sharing the L3. (optimization for multi-socket-ed systems).

Sorry for not writing s pseudo code example (not on your payroll).

Jim Dempsey

roman_pearce · ‎02-16-2009

I came up with a simpler solution: sleep for 4 microseconds whenever a thread has to spin. This gives the OS some breathing room to run. The result:

http://www.cecm.sfu.ca/~rpearcea/corei7_speedup_fixed.pdf

Which is basically ok. Your idea sounds interesting, but complicated. Of course I'll give you credit if I use parts of it, but I'm happy my simple solution (for now).

jimdempseyatthecove · ‎02-16-2009

Think of the method I suggested as a "Wait-Free" technique whereby when one thread stalls due to O/S intervention another thread can resume the unfinished task.

Although sleep may work some of the times, it will not work all (or near all) of the times.
Consider using SwitchToThread() over sleep. Most systems provide Sleep(miliseconds) not microseconds.
If the average throughput is what you are interested in then the Sleep or SwitchToThread may be sufficient.
If minimum worst case timing is important then consider a "wait-free" technique such as I proposed.

The wait-free code would be relatively simple to implement. Maybe tomorrow I will write some pseudo code.

Your chart looks very good now by the way.

Jim Dempsey

jimdempseyatthecove · ‎02-17-2009

Quoting - roman_pearce

I came up with a simpler solution: sleep for 4 microseconds whenever a thread has to spin. This gives the OS some breathing room to run. The result:

http://www.cecm.sfu.ca/~rpearcea/corei7_speedup_fixed.pdf

Which is basically ok. Your idea sounds interesting, but complicated. Of course I'll give you credit if I use parts of it, but I'm happy my simple solution (for now).

Roman,

Untested, un-optimized code sampel follows.

Jim Dempsey

[cpp]// buffer.cpp : Defines the entry point for the console application.
//

#include "stdafx.h"


int _tmain(int argc, _TCHAR* argv[])
{
	return 0;
}

const int sizeData = 123;
struct Item
{
	int	key;
	float	data[sizeData];
};

const int BufferMax = 456;
struct Buffer
{
	int	count;
	Item	items[BufferMax];
};

struct CopyControl
{
	unsigned __int32 indexPair;
	Buffer outC;
	Buffer inA;
	Buffer inB;
};

void MergeBuffer(
	CopyControl& aCopyControl)
{
	// may need to expand to __int64
	unsigned __int32 indexPairCopy1 = 0;
	unsigned __int32 indexPairCopy2 = 0;
	unsigned __int32 MergeDone = (aCopyControl.inA.count << 16) + aCopyControl.inB.count;
	for(;;)
	{
		indexPairCopy1 = aCopyControl.indexPair;
		if(indexPairCopy1 == MergeDone)
			return;
		// see if expected value
		while(indexPairCopy1 != indexPairCopy2)
		{
			// other thread took over control (we stalled)
			indexPairCopy2 = indexPairCopy1;
			SwitchToThread();	// or _mm_pause();
			indexPairCopy1 = aCopyControl.indexPair;
			if(indexPairCopy1 == MergeDone)
				return;
		}
		// here when we are in copy mode
		unsigned int indexA = indexPairCopy1 >> 16;		// union would be faster/better
		unsigned int indexB = indexPairCopy1 & 0xFFFF;
		unsigned int indexC = indexA + indexB;
		if(aCopyControl.inA.count < indexA)
		{
			if(
				(aCopyControl.inB.count == indexB)
					||
				(aCopyControl.inA.items[indexA].key < aCopyControl.inB.items[indexB].key)
			)
			{
				// pick A
				aCopyControl.outC.items[indexC] = aCopyControl.inA.items[indexB];
				indexPairCopy2 += 0x00010000;			// union would be faster/better
				continue;
			}
		}
		// pick B
		aCopyControl.outC.items[indexC] = aCopyControl.inB.items[indexB];
		indexPairCopy2 += 0x00000001;			// union would be faster/better
	}
}

// prior to call by either thread
// myThread.indexPair = 0;
// otherThread.indexPair = 0;
void MergeBuffers(
	CopyControl& myThread,
	CopyControl& otherThread)
{
	MergeBuffer(myThread);		// my primary concern 
	MergeBuffer(otherThread);	// my secondary concern
	myThread.indexPair = 0;		// reset only myThread.indexPair for next time
}
[/cpp]