How expensive is shared data?

racker · ‎04-01-2009

Hi all,

I had some trouble parallelizing a routine. Finally I found the solution, but now I am having difficulties in understanding the effects. It boils down to this simple sample:

[cpp]static int count = 0;

static int routine()
{
	if(++count>100000)
	{	count=0;
	
		EnterCriticalSection
		// from time to time do some exclusive, costly tests
		LeaveCriticalSection
	}

	//do the real work
}
[/cpp]

The original program ran the routine 16 million times, so it is an ideal candidate for parallelization.
Maintaining the counter in parallel did not seem ideal, but at least the routine would still run the
tests roughly every 100000th call, which is perfectly sufficient, so I did not bother to change anything.

When I ran this code sequentially it took 3 seconds to complete. In parallel on 8 CPUs it still took 2.5 sec.

Then I removed the tests but left the counter in place. As the tests are exclusive I suspected them to prevent scaling of parallelization. But execution times did not change, i.e. testing is already fairly synchronized by the counter.

Only when I finally removed the counter the execution time dropped to 0.8 seconds, which corresponds roughly to the sequential path of my program that provides the input for this routine.

Now I have two questions:
1. I understand that reading and writing shared data from different threads running on different CPUs will inhibit proper caching, but is it really plausible that incrementing a counter 16 mio. times in parallel on 8 CPUs will take 2.5 - 0.8 = 1.7 seconds? This seems incredibly expensive. Or does this trigger additional effects other than cache invalidation?

2. Is there any way of employing Intel tools for identifying such programming flaws? Intel's Amplifier showed 'optimal' parallelization; only the amount of work seemed to increase dramatically when run in parallel.

Thanks!

anthony_williams · ‎04-01-2009

Quoting - racker

Hi all,

I had some trouble parallelizing a routine. Finally I found the solution, but now I am having difficulties in understanding the effects. It boils down to this simple sample:

[cpp]static int count = 0;

static int routine()
{
	if(++count>100000)
	{	count=0;
	
		EnterCriticalSection
		// from time to time do some exclusive, costly tests
		LeaveCriticalSection
	}

	//do the real work
}
[/cpp]

The original program ran the routine 16 million times, so it is an ideal candidate for parallelization.
Maintaining the counter in parallel did not seem ideal, but at least the routine would still run the
tests roughly every 100000th call, which is perfectly sufficient, so I did not bother to change anything.

When I ran this code sequentially it took 3 seconds to complete. In parallel on 8 CPUs it still took 2.5 sec.

Now I have two questions:
1. I understand that reading and writing shared data from different threads running on different CPUs will inhibit proper caching, but is it really plausible that incrementing a counter 16 mio. times in parallel on 8 CPUs will take 2.5 - 0.8 = 1.7 seconds? This seems incredibly expensive. Or does this trigger additional effects other than cache invalidation?

This is entirely to be expected. First off, you're lucky that your counter increments correctly since ++count is not an atomic operation on a plain "int". Secondly, you're right that it will "inhibit proper caching" --- every increment will invalidate the cache line containing the counter on all CPUs, so it will have to be reloaded from memory by every CPU. This is as you note incredibly expensive. You are better off having each thread maintain a separate counter and only synchronizing the counter value periodically. e.g. synchronize the counter every 10000 iterations.

Contention on mutable shared state is the cause of pretty much all concurrency performance problems. Avoid it where you can.

jimdempseyatthecove · ‎04-01-2009

I suggest you create a thread private variable to hold a cardinal thread number (as opposed to OpenMP thread team member number). Then test this thread number to see if it is the number you designate as the count manager (e.g.0). If it is the designated thread, then it alone performs the increment and critical section code. Note, if the frequency is too low, consider dividing your interval by the number of cores.

Note, if your critical section only protects itself then you can remove the critical section (since now only one thread can enter the increment section. However, if the critical section is shared with other sections then you must keep it in your code.

Jim Dempsey

racker · ‎04-01-2009

Thank you for your replies and suggestions on how to remove the counter. But I have the feeling that this discussion is about to take a wrong turn. Probably I did not formulate my question clearly enough. Please let me rephrase:

I already successfully eliminated the counter from my code, so that's not really the point anymore. But I am currently in the process of parallelizing computation intensive parts of an existing application. Therefore I am wondering:

1. Is plausible that shared data causes such a dramatic performance degeneration? The application takes 3 seconds for 16 Mio. calls to the original function, doing the real work on one single CPU, including incrementing that old counter. When I parallelize clumsily, counting alone occupies 8 CPUs for 1.7 seconds. Should I simply accept that or should I worry about some other dependency that is hidden right now but might surface later?

2. Do Intels tools provide any guidance in identifying such data dependencies as I parallelize this application?

Thanks again.

fraggy · ‎04-01-2009

Quoting - racker

Should I simply accept that or should I worry about some other dependency that is hidden right now but might surface later?

You have to live with that :p
shared data have to be avoid at any cost.

If you have 100000 things to do, and if thoses things are not tightly related (every one of them can be done independently) You must separate them into 4 independent sets of data (if you have 4 cores) and assign each set to a thread. This will reduce shared data...
Can you tell us more about your application ?

draceswbell_net · ‎04-03-2009

Quoting - fraggy

Quoting - racker

Should I simply accept that or should I worry about some other dependency that is hidden right now but might surface later?

You have to live with that :p
shared data have to be avoid at any cost.

If you have 100000 things to do, and if thoses things are not tightly related (every one of them can be done independently) You must separate them into 4 independent sets of data (if you have 4 cores) and assign each set to a thread. This will reduce shared data...
Can you tell us more about your application ?

Sharing data isn't always terrible, but you need to take the time to understand the details of activity you are parallelizing. From this thread there appear to be four thing happening -

(1) When you have a critical region, then as everyone has mentioned you increase the time for this atomic operation. In your original code you went from a counter that stayed in cache (and maybe even in a register) to a number that required continued cache line flushing between the processors. I haven't measured the difference but this increase in time could easily be a factor of 100. Your original 3 seconds would increase it's single core time considerably just due to this factor. As mentioned in several posts, using a private variable is typically the primary way that this sharing is avoided. On the rare occasions that this information is required at the end of the run, a reduction operation is typically done after the last synch point. Using global reduction operations is the preferred way to gather state across parallel environments rather than building up the state one iteration at a time.

(2) There is overhead required for the system to manage and communicate between the additional threads. For longer runtime, this overhead often becomes a minimal time. For a 3 second task, this additional overhead can become a non-trivial contributor to the total time.

(3) Once you get past the additional time that multi-tasking adds to the original time then you need to identify any place that has data that needs to be in multiple places concurrently. Sharing data, as in a shared array, isn't necessarily bad, but requiring data to be in multiple places concurrently is very bad. As fraggy mentions, you want large chunks of work, but hidden sharing of data is a time killer. A first attempt at parallelism often has data from the same cache line going to two different cores. The same data may not be required in both cores, but just having data shared in a cache line has a dramatic impact on performance. This is probably the biggest hidden performance impact with a shared array.

(4) After you get all of overhead minimized, the states using global reduction operations and all of the data so that it is only required to be in a single place at a time, then you are ready to see how fast your application may run in parallel. Amdahl's Law gives you an upper bound on the performance improvement. Your statement about "mostly" parallel is one of the difficult areas. Here is a small example -if you start at three seconds and to parallelize it the time on a single cpu increases to six seconds (which isn't uncommon), then with four threads you need approximatley 70% parallelism just to get back to the original 3 seconds.You might be tempted touse all 8 cores after you have 90% parallelism, then the resulting time would be 1.3 seconds. You have now spent 8x the hardware to get 2.35x the performance. To be effective, the "mostly" has to be a very strong "almost all".

Based upon my experience, 3 seconds isn't a very long time and it is difficult to hide the parallelization overhead. If possible you might look at larger time chunks for the parallelism. Additionally you will likely get better gain from using the IPP and MKL libraries if you can use those for your processing. Some of these codes can take advantage of multiple cores, so this will probably give you more bang for your buck.