Re: Help with poor performance on Mac OS

mikedeskevich · ‎10-17-2007

I was wondering if anyone could help me with compiler options or something that would help increase the performance of TBB on Mac OS.

I have a very simple test case that I've been using as a benchmark for my cross platform thread development. I have a large array (can be as large as 128MB) that I loop through and take the square root of each element and store it in a separate array.

Serially:
for (i=0;i=sqrt(a);

I have tested this on an AMD dual core machine running Windows (with MSVC as my compiler) and I get nearly 2x speedup over serial for 2 (or more) threads, which is what I'd expect. The same code on a dual processor Linux/gcc system gives me less speedup, maybe 1.5x or so. But the disaster occurs on Mac OS/gcc. I have both a dual core and a quad core machine to test on. And on both systems, I see maybe only a 10% speedup over serial if I'm lucky, it's often less than that.

I assume that I've got to be doing something wrong. I can't imagine that the Linux or espcially the Mac system is that bad at multithreading that the overhead is taking away all advantage. Any hints on what I should be doing to fix this?

Thanks!
Mike

ARCH_R_Intel · ‎10-17-2007

Can you post the details of the parallel_for call site and how you wrote the "body" function object? There's a few possibilities:

Too small a grainsize can be an issue, though I doubt that is the case here since you're not seeing the problem on AMD.
Compiler optimizers can have sensitivities to the way the loop body is written. I usually load all my loop-invariant fields of interest into local variables before entering the loop. That helps some optimizers.
In your example, if sqrt is inlined as an SSE instruction, then there is only one floating point operation for every read and write. I.e., the ratio of computation to memory references is low. Perversely, a good optimizer can give worse speedup results, if it makes the serial baseline run faster.Optimization of sqrt can vary widely (e.g. whether it is inlined or not, whether it usesSSE instead of the x87 FPU, and whether it is vectorized). Do you havea sense of whether the Mac speedup is lower because of a lower multi-core performance, or a faster serial performance?

(3) is often the problem, at least for me. In most modern CPUs, the computational engine far outstrips memory bandwidth. The solution is to restructure the computation. For example, feed the square-roots immediately into the next stage of the computation instead of saving them to memory. Or reorder the computation to "block" it in cache. Can you say more about the nature of the overall computation? Someone might be able to suggest a way to restructure it.

mikedeskevich · ‎10-18-2007

Here's the body of the function (it's just for benchmarking, not production code or anything):

void serial_loop(double* src, double* dest, int size)
{
	while (size--) *dest++ = sqrt(*src++);
}

template  class parallel_loop
{
public:
	T* src;
	T* dest;
	parallel_loop(T* _src, T* _dest): src(_src), dest(_dest) {}
	void operator()(const tbb::blocked_range& r) const
	{
		int offset=r.begin();
		int size=r.end()-r.begin();

		serial_loop(src+offset,dest+offset,size);
	}
};

void loop(double* src, double* dest, int size)
{
	if (TBB_THREADS>0)
	{
		if (TBB_GRAINSIZE>0)
		{
			tbb::parallel_for(tbb::blocked_range(0,size,TBB_GRAINSIZE),parallel_loop(src,dest));
		}
		else
		{
			tbb::parallel_for(tbb::blocked_range(0,size),parallel_loop(src,dest),tbb::auto_partitioner());
		}
	}
	else
	{
		serial_loop(src,dest,size);
	}
}

I think you're right about the problem being issue #3. I just created a harder function to calculate on the loop and I'm getting perfect scalability upto 4 cores. So I'm guessing that it is memory bandwidth that's the problem.

Unfortunately, I cannot predict what the overall computation is going to be. What I'm doing is creating a library that does fast math on arrays of data. It's generally for scientists who get data and then need to processes. So they'll take their large arrays and want to apply many different transforms to it. So I can't do clever things like reordering operations or anything like that, it's up to the user to pick the operations and orders.

Thanks for your help, I guess what I'm going to need to do is have some type of tuning procedure in the library that picks the thresholds for parallization separately for each function on each different CPU that it runs.

ARCH_R_Intel · ‎10-18-2007

If your library has a lot of element-wise functions, perhaps you could have it reorder those operations. E.g., if f and g are known to be elementwise operations, and the user asks for fog, divide the array into ~100,000 byte chunks, and do f on the chunk followed by g on the chunk, before processing the next chunk.

Sometimes this is done by having the library operations not really do the operations immediately, but build a parse tree of the intended operation that is evaluated later. An extreme compile-time example of this is "expression templates". Another approach, used by RapidMind, is to do a just-in-time compiler that compiles the parse tree (that's a lot of work).

Unfortunately, the bandwidth problem is just getting worse as processors evolve, as shown in the depressing graph on http://www.cs.virginia.edu/stream/.